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1  SUMMARY 

This  Final  Report  summarizes  the  research  conducted  in  the  course  of  DARPA  PPAML  program,  which  was 
focused  on  enabling  the  use  of  discriminative  solvers  to  solve  discriminative  tasks  specified  in  probabilistic 
programs.  The  research  produced  two  complementary  methods  of  constructive  discriminative  solvers:  model- 
driven  and  data-driven  ones: 

•  Generative  models:  a  novel  framework  for  most  accurate  computation  of  key  statistical  elements  of 
model-driven  problems  (such  as  conditional  probability,  regression,  etc.) 

•  Discriminative  models:  a  novel  framework  for  capturing  domain  knowledge  in  the  form  of  features 
and  kernels  for  standard  data-driven  problems  (solved  in  LUPI  approaches). 

These  achievements  are  described  in  in  this  report  and  in  9  papers  published  in  the  course  of  DARPA  PPAML 
program. 


2  INTRODUCTION 

As  explained  in  Introduction,  the  focus  of  our  project  is  to  enable  the  use  of  discriminative  solvers  to  solve 
discriminative  tasks  specified  in  probabilistic  programs.  We  refer  to  the  system  that  we  will  research  and 
develop  as  Discriminative  LEarning  for  GENerative  Tasks  (DILEGENT).  We  achieve  this  by  focusing  on 
two  complementary  methods  of  constructive  discriminative  solvers:  model-driven  and  data-driven  ones. 
Conceptually,  we  illustrate  them  in  Figure  1.  For  both  methods,  our  goal  is  to  create  a  decision  rule  (lower 
right  corner  in  Figure  1)  based  on  training  data  (lower  left  comer  in  Figure  1).  In  model-driven  approach,  the 
path  to  that  decision  rule  consists  of  two  conceptual  steps  (building  a  probabilistic  model  and  using  it  to  create 
a  decision  rule  -  upper  part  of  Figure  1),  whereas  in  data-driven  approach,  that  path  consists  of  one  direct  step 
(lower  part  of  Figure  1).  Both  approaches  have  known  advantages  and  disadvantages,  as  explained  next. 


Data 


Decision  rule 


Figure  1.  Model-Driven  and  Data-Driven  Approaches. 
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The  model-driven  approach  is  based  on  the  assumption  that  the  general  underlying  model  is  specified,  and 
the  only  tasks  that  have  to  be  executed  sequentially  are  (1)  to  estimate  parameters  of  the  model,  and  (2)  to 
construct  the  decision  function  based  on  completely  specified  model.  This  is  a  well-established  approach  with 
a  variety  of  mathematical  and  programming  tools  available.  However,  the  actual  system  structure  and 
underlying  probabilistic  distributions  may  differ  from  the  simplified  model,  and  the  target  parameters  may  be 
difficult  to  estimate  accurately  due  to  insufficient  data,  poor  diversity  of  models,  ill-posed  problems  etc.  Our 
goal  is  to  bring  some  ideas  developed  in  the  data-driven  approach  to  improve  its  performance. 

The  data-driven  approach  does  not  rely  on  specific  models;  instead,  it  is  focused  on  finding  the  best  decision 
function  directly.  Since  it  has  one  conceptual  step  instead  of  two  sequential  steps  used  by  model-driven 
approach,  it  typically  deliver  better  performance  (in  terms  of  accuracy  /  error  rate,  robustness,  etc.).  However, 
by  doing  that,  potentially  valuable  domain  knowledge  information,  which  justifies  our  other  goal  of  bringing 
some  ideas  developed  in  model-driven  approach  to  improve  its  performance. 

The  potential  applications  of  both  approaches  enhanced  in  the  project  include  improved  performance  of  a 
variety  of  key  statistical  /  machine  learning  mechanisms,  such  as  Classification,  Regression,  Ensemble, 
Recommendation,  Ranking,  Missing  data.  Multiple  conflicting  decisions,  Imbalanced  data,  etc. 

The  key  enabling  technology  for  model-driven  approach  is  a  scalable  algorithm  for  solving  underlying  ill- 
posed  (unstable)  integral  equations  for  conditional  probability  by  restricting  their  solutions  to  monotonic 
functions  thus  converting  them  into  well-posed  (stable)  ones.  The  key  enabling  technology  for  data-driven 
approach  is  a  method  of  encoding  knowledge  (such  as  privileged  information,  structure  information,  etc.)  into 
additional  features  before  applying  standard  machine  learning  algorithms. 

As  a  result  of  development  and  testing  these  technologies  in  model-driven  approach,  we  implemented 
conditional  probability  estimation  techniques  that  produce  accurate  data-based  solutions  with  improved 
accuracy  by  35%  over  SoA  (standard  ensemble  methods).  Correspondingly,  in  data-driven  approach,  we 
implemented  techniques  for  encoding  model-based  information  into  features  with  improved  performance  by 
40%  over  SoA  (standard  SVM  and  neural  networks). 


3  METHODS,  ASSUMPTIONS  AND  PROCEDURES 


In  this  section,  we  present  our  results  on  both  model-driven  and  data-driven  approaches. 

3.1  Model-Driven  Approach 

Conditional  probability  is  one  of  central  concepts  in  computational  decision  making.  Indeed,  from  a  decision 
involves  making  a  choice  from  a  set  of  possible  choices  based  on  input  information,  for  example,  in  an  image 
classification  problem,  a  label  is  assigned  to  a  given  image  by  analyzing  its  pixels.  If  a  decision  involves 
probabilistic  classification,  it  is  crucial  to  be  able  not  only  to  map  the  observed  data  into  one  of  the  pre¬ 
determined  classes,  but  to  do  so  with  varying  degrees  of  confidence. 
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Current  discriminative  methods  are  mostly  developed  for  accurately  outputting  values  (categories,  numerical, 
ordinal  values,  etc.)  as  their  decisions.  This  is  because  they  are  designed  to  directly  minimize  the  risk  of 
incorrect  values  (misclassification,  regression  value,  etc.).  They  do  achieve  a  good  performance  at  this  task 
because  predicting  values,  say  0  or  1  in  classification,  turns  out  to  be  a  simpler  task  than  predicting  conditional 
probability  of  a  particular  value.  To  address  the  need  of  predicting  conditional  probability,  these  methods 
employ  an  ad-hoc  post-processing  step  where  the  output  of  the  classical  discriminative  methods  is  mapped  to 
conditional  probabilities.  This  is  both  inefficient  and  erroneous.  It  is  inefficient  because  it  is  a  multistep 
process.  It  is  erroneous  because  the  output  of  the  first  step,  the  decision-making  step,  reduces  a  multi¬ 
dimensional  input  to  a  single  dimensional  output  which  means  there  is  information  loss  and  there  may  not  be 
sufficient  information  to  correctly  estimate  the  conditional  probability  from  a  single  dimensional  input. 

The  ability  to  accurately  and  efficiently  output  conditional  probabilities  is  a  natural  first  step  in  advancing 
discriminative  methods  to  answer  queries  in  probabilistic  programs  by  employing  them  as  solvers  of 
probabilistic  programs.  So,  we  conducted  research  on  a  novel  fundamental  approach  to  estimating  conditional 
probabilities ;  within  the  framework  of  discriminative  learning,  these  probabilities  translate  into  probabilistic 
classification ,  i.e.,  the  set  of  probabilities  assigned  to  possible  classifications  of  any  given  data  point.  Our 
approach  aims  at  providing  better  estimates  of  conditional  probabilities  and  is  suitable  for  a  wide  range  of 
problems  in  decision  theory  and  machine  learning. 

In  our  papers  [1]  [2]  [3],  we  focused  on  main  targets  of  statistical  inference  theory  is  estimation  (from  the 
data)  of  specific  models  of  random  events,  namely: 

1.  conditional  probability  function; 

2.  conditional  density  function; 

3.  regression  function; 

4.  density  ratio  function. 


These  models  can  be  represented  in  the  following  manner.  Let  F(x)  be  a  cumulative  distribution  function  of 
random  variable  x.  We  call  non-negative  function  p(x)  the  probability  density  function  if 


"  3 

J 


p  (x*)dx*  =  F(x) 


(1) 


Similarly,  let  F(x,  y)  be  the  joint  probability  distribution  function  of  variables  x  and  y .  We  call  non-negative 
p(x,  y)  the  joint  probability  density  function  of  two  variables  x  and  y  if 


r  y  r * 

p  ( x*,y*)dx*dy *  —  F(x,y). 

j  —  CO  J  —  CO 


(2) 


Let  p(x,  y)  and  p(x)  be  probability  density  functions  for  pairs  (x,  y)  and  vectors  x.  Suppose  that  p(x)  >  0. 
The  function 


p(y|x) 


pQ,y) 

p(x) 


(3) 


is  called  the  Conditional  Density  Function.  It  defines,  for  any  fixed  x  =  x0,  the  probability  density  function 
p(y|x  =  x0)  of  random  value  y  G  R1.  The  estimation  of  the  conditional  density  function  from  data 


(4) 


is  the  most  difficult  problem  in  our  list  of  statistical  inference  problems. 
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Along  with  estimation  of  the  conditional  density  function,  the  important  problem  is  to  estimate  the  so-called 
Conditional  Probability  Function.  Let  variable  y  be  discrete,  say,  y  G  {0,1}.  The  function  defined  by  the  ratio 

p(x,  y  —  1) 

p(y  =  Ik)  = - - ,  V 00  >  0  (5) 

pW 

is  called  Conditional  Probability  Function.  For  any  given  vector  x  —  x0,  this  function  defines  the  probability 
that  y  is  equal  to  one;  correspondingly,  p(y  —  0 \x  —  x0)  =  1  —  p(y  =  1  |x  =  x0).  The  problem  is  to  estimate 
the  conditional  probability  function,  given  data  (5)  where  y  G  {0,1}. 


Estimation  of  the  conditional  density  function  is  a  difficult  problem;  a  much  easier  problem  is  the  problem  of 
estimating  the  so-called  Regression  Function  (conditional  expectation  of  the  variable  y): 


r(x)  =  J  yp(y|x)dy,  (6) 

which  defines  expected  value  y  G  R1  for  a  given  vector  x. 

We  also  consider  a  problem,  which  is  important  for  applications:  estimating  the  ratio  of  two  probability 
densities.  Let  pnum(x)  and  Pden(x)  >  0  be  two  different  density  functions  (subscripts  num  and  den 
correspond  to  numerator  and  denominator  of  the  density  ratio).  Our  goal  is  to  estimate  the  function 


given  iid  data 

distributed  according  to  pdcn(x)-  and  iid  data 


_  Pnum(x) 

1  }  PdenM 

(7) 

Xlt  ■  ■ 

(8) 

X\,.. 

m,X  Alum' 

(9) 

distributed  according  to  pnum(x)- 


Next,  we  describe  our  direct  settings  for  these  four  statistical  inference  problems. 


By  definition,  conditional  density  p(y|x)  is  the  ratio  of  two  densities 

p (y I x )  =  P(p^  >  p(x)  >  o 


(10) 


or,  equivalently, 

p(ylx)p(x)  —  p(x,y)-  (11) 

This  expression  leads  to  the  following  equivalent  one: 

J  j9(y  —  y')0(x  —  x')f(x',y')dF(x')dy'—F(x,y)  (12) 

where  /(x,y)  =  p(y|x),  function  F(x)  is  the  cumulative  distribution  function  of  x  and  F(x,  y)  is  the  joint 
cumulative  distribution  function  of  x  and  y. 

Therefore,  our  setting  of  the  condition  density  estimation  problem  is  as  follows: 

•  Find  the  solution  of  the  above  integral  equation  in  the  set  of  nonnegative  functions  f  (x,  y)  = 
p(y|x)  when  the  cumulative  probability  distribution  functions  F(x,y)  and  F(x)  are  unknown  but  iid 
data 


are  given. 

In  order  to  solve  this  problem,  we  use  empirical  estimates 


1 

fy(.x,y)  =  9  (y  -  yt)9(x  -  xt), 


i=i 


(13) 


(14) 
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(15) 


F,(x)  = -^9  (x  -  xd 

i= 1 

of  the  unknown  cumulative  distribution  functions  F(x,  y)  and  F  (x) .  Therefore,  we  have  to  solve  an  integral 
equation  where  not  only  its  right-hand  side  is  defined  approximately  (we  can  only  deal  with  Ft>  (x,  y  )  instead 
of  F(x,  y)),  but  also  the  data-based  approximation 

y)  =  J  J  0  (y  -  y')0{x  -  x')/(x' ,y')dy' dF^(x')  (16) 

is  used  instead  of  the  exact  integral  operator 

Af{x,y)  —  j  J  0  (y  —  y')0(x  —  x')f(x',y')dy'dF(u').  (17) 

Taking  into  account  empirical  estimates  F^(x,  y)  and  F(; (x) ,  our  goal  is  thus  to  find  the  solution  of 

approximately  defined  equation 

t  y  { 

%0(x-Xt)J  f  (.Xt,y')dy'  *  (y  -  y*)0(x  -  Xt).  (18) 

i= 1  -0°  i=l 

According  to  the  definition  of  conditional  probability, 

CcoP  (y\*)dy  =  1,  Vx  6  X.  (19) 

Therefore,  the  solution  of  our  equation  has  to  satisfy  the  constraint  /(x,y)  >  0  and  the  constraint 


fZo  f(y'’x)dy'  =  vxex. 


(20) 


We  call  this  setting  the  direct  constructive  setting  since  it  is  based  on  direct  definition  of  conditional  density 
function  above  and  uses  theoretically  justified  approximations  F^-(x,y)  and  Fe(x)  of  the  corresponding 
unknown  functions.  In  other  words,  direct  constructive  approach  that  we  developed  consists  of  replacing  the 
unknown  cumulative  distribution  functions  we  use  their  empirical  approximations 

F,(x)  =  1-Zi=10(x-Xi),  (21) 

t 

F{(x,y  =  1)  =  P/F^(x|y  =  1)  =  ^^y*  0(x  -  X{),  (22) 

i= 1 

where  Pi>  is  the  ratio  of  the  number  of  examples  with  y  =  1  to  the  total  number  £  of  the  observations.  These 
empirical  approximations  are  then  used  as  a  replacement  of  original  functions  in  the  corresponding  integral 
equations  and  solving  the  resulting  systems  by  regularization  approach  and  minimization  of  discrepancy 
between  right-hand  side  and  left-hand  sides 

Therefore,  one  has  to  solve  our  original  integral  equation  with  approximately  defined  right-hand  side  and 
approximately  defined  operator 

A ,f(x)  =  i  0  (x  -  Xi)f(Xi).  (23) 

Since  the  probability  takes  values  between  0  and  1,  our  solution  has  to  satisfy  the  bounds 

0  <  f{x)  <1,  Vx  G  X.  (24) 

Also, 

/  /  (x)dF(x)  =  p(y  =  1),  (25) 

where  p(y  =  1)  is  the  probability  of  y  =  1. 
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By  definition,  regression  is  the  conditional  mathematical  expectation 
This  can  be  rewritten  in  the  form 


r(x)  =  /  yp(y\x)dy  =  f  y^rfdy. 


(26) 

(27) 

(28) 


r(x)p(x)  =  f  yp(x,y)dy. 

Thus  we  obtain  the  equivalent  equation 

fe(X-x')r(x'W(X')  =  fe(X-X')fydFCX'.y). 

Therefore,  the  direct  constructive  setting  of  regression  estimation  problem  is  as  follows: 

In  a  given  set  of  functions  r(x),find  the  solution  of  integral  equation  (6)  if  cumulative  probability  distribution 
functions  F(x,  y)  and  F(x)  are  unknown  but  iid  data  (y^Xf), (y{,Xf)  are  given. 

As  before,  instead  of  these  functions,  we  use  their  empirical  estimates.  That  is,  we  construct  the  approximation 


A{r{x)  =  f'Yj  9  (x  ~  xdr(<xd 


i=i 


instead  of  the  actual  operator,  and  the  approximation  of  the  right-hand  side 


Fi>(x)  =j'Yjyj0(x-Xj) 


7  =  1 

instead  of  the  actual  right-hand  side  in  the  above  integral  equation,  based  on  the  observation  data 

(yi,X1),...,(y£,Xf),  y  £  R1,  x  E  X. 


(29) 


(30) 


(31) 


Let  Fnum(x)  and  Fden(x)  be  two  different  cumulative  distribution  functions  defined  on  X  c  Rd  and  let 
Pnum(x)  and  Pden  (x)  be  the  corresponding  density  functions.  Suppose  that  pdcn  (x)  >  0,  x  G  X.  Consider  the 
ratio  of  two  densities: 

Pnum(.x ) 


^(X)  = 

P  den  O 

The  problem  is  to  estimate  the  ratio  F(x)  when  densities  are  unknown,  but  iid  data 

Xi, .  ■■>X{den  ~  Fden (x), 


generated  according  to  Fden(x),  and  iid  data 


X\,...,X\ 


i(x), 


(32) 

(33) 

(34) 


generated  according  to  Fnum(x),  are  given. 

As  before,  we  introduce  the  constructive  setting  of  this  problem:  solve  the  integral  equation 

|  0  (x  -  u)R(u)dFden(u)  =  Fnum(x)  (35) 

when  cumulative  distribution  functions  Fden(x)  and  Fnum(x)  are  unknown,  but  the  data  drawn  from  these 
distributions  are  given.  As  before,  we  approximate  the  unknown  cumulative  distribution  functions  Fnum(x) 
and  Fden  (x)  using  empirical  distribution  functions 
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“  11  Ulll 

*-  mim  •  • 


for  Fnum(x),  and 


'  num  .  „ 
7=1 


^de 


H  pn  ■  ^ 


(36) 


(37) 


'  den  r-  „ 
7  =  1 


for  Fden(x). 

Since  F(x)  >  0  and  lim  Fnum  (x)  =  1,  our  solution  has  to  satisfy  the  constraints 

X— >co 

R  (x)  >0,  Vx  G  X, 


I 


R  (x)dFden(x)  =  1. 


(38) 

(39) 


Therefore,  all  main  empirical  inference  problems  described  above  (conditional  probability,  regression,  density 
ratio)  can  be  represented  via  (multidimensional)  Fredholm  integral  equation  of  the  first  kind  with 
approximately  defined  elements.  Although  approximations  converge  to  the  true  functions,  these  problems  are 
computationally  difficult  due  to  their  ill-posed  nature.  Thus  they  require  rigorous  solutions.  Various  statistical 
methods  exist  for  solving  these  inference  problems.  Our  goal  is  to  find  general  rigorous  solutions  that  take 
into  account  all  the  available  characteristics  of  the  problems. 


We  now  present  a  general  form  for  all  statistical  inference  problems. 


Consider  the  multidimensional  Fredholm  integral  equation 

|  6  (z  -  z')f(z')dFA(z')  =  FB(z),  (40) 

where  the  kernel  of  operator  equation  is  defined  by  the  step  function  0(z  —  z'),  the  cumulative  distribution 
functions  FA  (z)  and  FB  (z)  are  unknown  but  the  corresponding  iid  data 

Zx,...,ZtA~FA{z)  (41) 

Zv-,ZfB  ~  Fb(z)  (42) 

are  given.  In  the  different  inference  problems,  the  elements  /(z),  FA(z),  FB  (z)  of  the  equation  have  different 
meanings: 


•  In  the  problem  of  conditional  density  estimation,  vector  z  is  the  pair  (x,  y),  the  solution  /(z) 
is  p(y|x),  the  cumulative  distribution  function  FA(z )  is  F(x)  and  the  cumulative  distribution  function 
Fb(z)  is  F(x,  y). 

•  In  the  problem  of  conditional  probability  p(y  =  l|x)  estimation,  vector  z  is  x,  the  solution 
/(z)  is  p(y  =  l|x),  the  cumulative  distribution  function  FA(z)  is  F(x),  the  cumulative  distribution 
function  FB(z)  is  F(x|y  =  l)p(y  =  1),  where  p(y  =  1)  is  the  probability  of  class  y  =  1. 

•  In  the  problem  of  density  ratio  estimation,  the  vector  z  is  x,  the  solution  /(z)  is 
Vnum  (a)  / Vden  (%)  >  the  cumulative  function  FA  (z)  is  Fnum  (x) ,  the  cumulative  function  FB  (z)  is 
Fden  00  • 

•  In  the  problem  of  regression  F(x)  =  f  y  p(y\x)dy  estimation,  the  vector  z  is  (x,y),  where 
y  >  0,  the  solution /(z)  is  y_1F(x),  (F(x)  =  Jy  p(y |x) dy),  the  cumulative  function  Fj4(z)  is  F(x), 
the  cumulative  function  FB(z)  is  y-1  /  6  (x'  —  x')y'dF(x',y'). 
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Since  statistical  inference  problems  have  the  same  kernel  of  the  integral  equations  (i.e.,  the  step-function)  and 
the  same  right-hand  side  (i.e.,  the  cumulative  distribution  function),  it  allows  us  to  develop  a  common  standard 
method  for  solving  all  inference  problems. 

As  a  result,  we  have  developed  [1]  [2]  [3]  the  fundamental  approach  for  generative  models  that  allows  for 
most  accurate  estimation  of  the  key  statistical  quantities.  The  approach  has  been  tested  on  a  number  of 
synthetic  examples,  consistently  delivering  performance  that  exceeds  that  of  standard  methods. 

Figure  2  illustrates  both  classical  method  and  DILEGENT  method  applied  to  the  same  problem  of  estimating 
the  conditional  probability  (true  conditional  probability  is  shown  as  blue  line,  and  its  estimate  -  as  black  line) 
based  on  one-dimensional  samples  of  two  classes  (shown  as  red  and  green  markers  around  horizontal  axis), 
consisting  of  48,  96,  192  and  384  elements.  Our  novel  approach  is  shown  in  the  right  column  in  Figure  2, 
while  the  classical  approach  is  shown  in  the  left  column. 

As  Figure  2  illustrates,  with  the  increase  of  training  sample  size  (from  48  to  384)  the  resulting  approximations 
converge  to  the  true  conditional  probability  converge,  but  our  approach  does  it  faster  and  more  accurately 
than  the  classical  approach. 


In  the  context  of  S VM,  the  conditional  probability  of  S VM  outputs  can  be  further  analyzed  in  the  following 
manner.  As  Platt  [4]  observed,  the  smaller  is  the  (negative)  score  s.  for  vector  z.,  the  closer  is  the  conditional 

probability  P(y=l  l.v.)  to  zero  and,  the  larger  is  the  (positive)  score  s.,  the  closer  is  the  conditional  probability 
P(y=  1  l.v .)  to  one.  Platt  introduced  a  method  for  mapping  SVM  scores  into  values  of  conditional  probability 
based  on  two  hypotheses,  a  general  one  and  a  special  one. 


The  general  hypothesis:  Conditional  probability  function  p(y=  1  l.v)  is  a  monotonic  function  of  variable  5. 

The  special  hypothesis:  Conditional  probability  function  can  be  approximated  well  with  sigmoid  functions 
with  two  parameters: 

P(y=m=  1+expl-As+By  A.BER'.  (43) 

Using  the  maximum  likelihood  technique,  [4]  introduced  effective  methods  to  estimate  both  parameters 
AM  (see  [5]). 
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/-matrix  F-matrix 

(classical  approach)  (new  approach) 


Training  data  size  of  48:  class  1  (24  samples),  class  2  (24  samples) 


Training  data  size  of  96:  class  1  (48  samples),  class  2  (48  samples) 


Training  data  size  of  192:  class  1  (96  samples),  class  2  (96  samples) 


Training  data  size  of  384:  class  1  (192  samples),  class  2  (192  samples) 


Figure  2.  Model-Driven  and  Data-Driven  Approximations  of  Conditional  Probability. 

Platt’s  approach  was  shown  to  be  useful  for  calibration  of  SVM  scores.  Nevertheless,  this  method  has  certain 
drawbacks:  even  if  the  conditional  probability  function  for  SVM  is  monotonically  increasing,  it  does  not 
necessarily  have  the  form  of  a  two -parametric  sigmoid  function.  It  is  easy  to  construct  examples  where 
suggested  sigmoid  function  does  not  approximate  well  the  desired  monotonic  conditional  probability  function. 


The  one-dimensional  problem  mentioned  in  the  previous  paragraph  has  the  following  form:  given  pairs 
(values  s.  of  SVM  scores  and  corresponding  classifications  y.) 

(Vjjj),...,^^),  (44) 

find  an  accurate  approximation  of  the  monotonic  conditional  probability  function  p(y=\  Is).  Further,  we 
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describe  a  technique  for  construction  of  a  monotonic  approximation  of  the  desired  function.  This 
approximation  provides  a  more  accurate  estimate  than  the  one  based  on  sigmoid  functions.  We  also  consider 
a  more  general  (and  more  important)  problem  than  this  one-dimensional  one.  Suppose  we  have  d  different 
SVMs,  solving  the  same  classification  problem.  Also,  suppose  that  the  probability  of  class  y=l  given  scores 
s=(sl,...,sd )  of  d  SVMs  is  a  multidimensional  monotonic  conditional  probability  function:  for  any 
coordinate  k  and  any  fixed  values  of  the  other  coordinates  (s\...,/~1,/'+1,...//)  ,  the  higher  is  the  value  of 

score  sk,  the  higher  is  the  probability  P(y=l\s). 

The  goal  is  to  find  a  method  for  estimation  of  the  monotonic  conditional  probability  function  P(y=lE)  for 
multidimensional  vectors  s=(sl,...,sd )  ;  that  is,  to  combine,  in  a  single  probability  value,  the  results  of 
multiple  (namely,  d)  SVMs.  We  show  that  estimating  conditional  probability  function  in  a  set  of  monotonic 
functions  has  a  significant  advantage  over  estimating  conditional  probability  function  in  a  general,  non¬ 
monotonic  set  of  functions:  it  forms  a  well-posed  problem  rather  than  an  ill-posed  problem. 


The  decision  rule  for  a  two-class  pattern  recognition  problem  can  be  obtained  using  the  estimated 
conditional  probability  function  P(y=  1  l.v)  as 


v=0lP(v=lh)- 


~2)~ 


(45) 


It  is  important  to  note  that,  in  classical  machine  learning  literature,  there  are  ensemble  methods  that 
combine  several  rules  (see  [6],  [7],  [8]).  The  difference  between  ensemble  rules  and  synergy  rules  is  in  the 
following: 


1)  Ensemble  rule  is  a  result  of  structural  combination  (such  as  voting  or  weighted  aggregation)  of 
several  classification  rules. 

2)  Synergy  rule  defines  the  optimal  solution  to  the  problem  of  combining  several  scores  of  monotonic 
rules.  It  is  based  on  effective  methods  of  conditional  probability  estimation  in  the  set  of  monotonic 
functions. 

3)  Synergy  rule  is  constructed  only  for  monotonic  rules  (such  as  SVM)  in  contrast  to  ensemble  rule 
which  combines  any  rules.  Synergy  is  the  property  of  monotonicity  of  the  solution. 

Our  goal  is  to  minimize  conditional  probability  in  the  set  of  monotonically  increasing  functions.  We  do 
this  by  using  expansion  of  desired  function  on  kernels  that  generate  splines  with  infinite  number  of  knots 
(INK-spline)  of  degree  zero.  The  reason  we  use  these  kernels  is  that  they  enable  an  efficient  and 
straightforward  construction  of  multidimensional  monotonic  functions;  it  is  possible  that  some  other  kernels 
might  be  used  for  that  purpose  as  well. 


According  to  the  definition  in  the  one-dimensional  case,  splines  of  degree  r  with  m  knots  are  defined  by 
the  expansion  (here,  we  assume  that  0<r<  1 ) 


where 


r  m 

S(x\r,m)=  I  c/+  X  ek(x-ak)+, 
s= 0  k= 0 

)  (x-aY if  x-a> 0 
( x-a,)+=  i  k  k 

k  1 0  otherwise 


(46) 


(47) 
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We  generalize  this  representation  using  infinite  number  of  knots: 


00 


SJx)=  X  CXS+  f  g(T)(x-T)r+dT. 

s= 0  o 


(48) 


Following  the  approach  from  [9],  [10],  we  define  the  kernel  with  infinite  number  of  knots  (INK-spline)  of 
degree  r  for  expansion  of  the  function  of  one  variable  x>0  in  the  form 

GO  f 

K(x.,x.)=  f  (x-x)+(x-x)+dx=  X  2r-k+\^minl{Xi,Xj^ 

0  k=0 


'2d~k+\x~x\k 
‘  j 


(49) 


(here  we  modified  the  definition  of  INK-kernel  from  [10]  by  omitting  its  polynomial  portion). 


For  r=0,  the  INK-spline  kernel  has  the  form 

K(){x.,x.)=mm  { x.,x.  (50) 

for  r=\,  the  INK-spline  kernel  has  the  form 

1  3  1  2 

Kx (x.,x.)=  ^  (min { x.,x. } )  +  2  (min { xrxj ) )  ^~x} ■  (5 1 ) 

In  the  multidimensional  case,  the  INK-spline  of  degree  r  is  defined  as 

d  t  t 

K(x.^=  II Kr(xitKj\  x=(xl,...,xd).  (52) 

k=  1 

In  order  to  find  a  monotonic  solution,  we  use  our  method  for  estimating  conditional  probability  function  with 
INK-spline  kernel  of  degree  zero  with  additional  £  monotonicity  constraints.  That  is,  we  have  to  minimize 
the  functional 

W=(KA+bl  {  )rV(KA+bl  f)~2(KA+IA  /VY+y^KA  (53) 

(here  coordinates  of  vector  Y  are  G  (—1,  +1}  subject  to  £+1  inequality  constraints 

Att(0)  >  0, Att(x/)  >  0 ,j  =  1,  ...,l  (54) 


Let  v>0.  Then,  in  order  to  construct  the  conditional  probability  in  the  set  of  non-negative  monotonic  functions 
bounded  by  the  value  1,  we  have  to  enforce  the  constraint  P(y=l  l.r)<l .  Thus,  taking  into  account  nonnegativity 
and  mono  tonicity  constraints,  we  add  the  constraint 

AtK(x=  1  )+b=Ar'*x+b<  1 ,  (55) 

where  x=(.i'|,....v'()  .  Using  L^-norm  SVM  for  estimating  monotonic  conditional  probability  function,  we 
minimize  the  functional 

M  A  )=A  rKKA-2ArKY+y  ( A  rKA ,  (56) 

with  coordinates  of  Y  are  in  [0,1]  subject  to  £+2  inequality  constraints  described  above.  In  multidimensional 
case,  where  we  can  assume  (by  proper  normalization)  that  H  6  [0,1] d.  We  consider  the  solution  of  the 
equation  in  the  form 
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(57) 


£ 

f(x)=  la  .K(x.,x)+b, 
i=  1 

where  the  kernel  generating  d-dimensional  INK-spline  of  degree  zero  has  the  multiplicative  form 

d 

K(x.,x)=  (58) 

k=  1 


Along  with  functions  defined  by  (multiplicative)  INK-spline  kernels  of  degree  zero  that  can  construct 
approximations  to  monotonic  functions,  we  consider  functions  defined  by  the  additive  kernel  (which  is  a 
sum  of  one-dimensional  kernels) 


£ 

/(-*)=  z 


d 


Zk  ,  .  Ic  L-.  j 

a/minCx^x )+b. 


i=  1  k=  1 


(59) 


In  order  to  find  dxl  coefficients  af  of  expansion  in  estimating  conditional  probability  function  in  the  direct 
setting,  we  minimize  the  functional 


i?(Alv..A^> 


d 


I  KtAt+bl( 
k=  1 


T 

V 


d 


^  KlAk+^Z 

k=  1 


2 


d 


Z 

k=  1 


VY+y  Z  (a TkKk\) 

k=  i 


(60) 


subject  to  dx(£+l)  inequality  constraints 


oflx-a) 

dxk 

5/T0;q) 

dxk 


Z  a ■  ®(x^-Xj)=A^x(Xj)>0 ;  j=  1 ; . 
(=1 

€  ,  j 

z  a-A^xfO^O;  k=l;...;d; 
(=1 


k — 1 , . . . 


(61) 


where  we  have  denoted  by  A^  the  £-dimensional  vector  of  af  =  ... ,  af ),  k  =  1, ,  d  ,  by  K,  the  (£x£)- 

dimensional  matrix  of  elements  KAxi,x^)=  min(xf,x*),  and  by  r(xf)  =  (0(xf  —  xj0), ... ,  0(xf  —  xf)Y  ,j  — 

1, ...  ,l,k  =  1, ... ,  d  we  have  denoted  the  dx£  vectors  of  dimensionality  £. 

Let  vector  x=(x\ . . .  ,x^)  have  bounded  coordinates 

0<xk<ck,  k=l,...„d  .  (62) 

Since  conditional  probability  does  not  exceed  1,  we  need  one  more  constraint  7-’(y=l  Icj,. . .  ,t'd)<  1  .  That 

is,  we  have  to  add  the  constraint 


EAtV+Kl,  (63) 

k=l 
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where  we  have  denoted  Xk=(x\,...,x\)T  .  A  function  satisfying  the  described  conditions  is  monotonic. 

In  order  to  estimate  a  multidimensional  monotonic  function  using  multiplicative  kernel,  one  has  to  solve  a 
quadratic  optimization  problem  of  order  £  subject  to  N=id  inequality  constraints. 

With  additive  kernel,  one  has  to  estimate  dx(£+l)  parameters  under  dx(£+l)  constraints.  To  decrease  the 
computation  amount: 

1.  One  can  replace  V- matrix  with  I- matrix. 

2.  For  additive  kernel,  one  can  estimate  multidimensional  conditional  probability  function  in  the 

t 

restricted  set  of  functions  where  ou=u;-,  for  some  or  for  all  t. 

3.  One  can  consider  linear  structure  of  the  solution  using  d  one-dimensional  estimates  of  conditional 

probability  P(y=lL/)  obtained  by  solving  one-dimensional  estimation  problems  and  then  approximate 
the  multidimensional  conditional  probability  function  as 

.  ,  d 

P(y=\\sf.../)=  X  P^O=ll/),  (64) 

t=  1 

where  its  weights  [^>0,  X  P/=  1  are  computed  by  solving  an  d-dimensional  quadratic  optimization 

problem  under  d+ 1  constraints.  That  optimization  problem  is  formulated  as  follows:  minimize  the 
functional 

BTPVPB~2BTPVY+yBTB  (65) 

subject  to  the  constraints 

5>0,  Bt  1=1,  (66) 

where  we  have  denoted  by  B  vector  of  coefficients  5=(P1,. .  ,,Prf)  ,  by  P  the  (dxf)-dimensional 
matrix  P  =  p(xf),  t  —  1, ... ,  d;  i  =  1, ... ,  l. 

We  now  construct  several  examples  of  synergy  rules  for  SVMs  where  we  use  the  same  training  set  both 
for  constructing  SVM  rules  sk  —  st(x),  t  =  1, . . ,  d  for  estimating  the  conditional  probability 
P(y=  1  l.v  j ,. . .  ,.v^)  . 

Suppose  that  our  rules  were  constructed  using  different  SVM  kernels  Kt(x,y )  and  the  same  training  set 

(xi,yi),...,(xl,yl)  (67) 

and  let  s[, ...  ,l) ,t  —  1, ... ,  d  be  the  scores  st=ft(x)  obtained  using  vectors  x. 

Note  that  these  scores  are  statistically  different  from  the  scores  obtained  using  £  elements  of  test  set 
(support  vectors  /  are  biased:  in  the  separable  case,  all  h*l=l).  Therefore,  it  is  reasonable  to  use  scores 
obtained  in  the  procedure  of  k-fold  cross-validation  for  estimating  parameters  of  SVM  algorithm. 

Also,  note  that  while  individual  components  of  the  same  d-dimensional  vector  Sf=(si,. . .  Jd)  are 
interdependent,  the  vectors  S'  themselves  are  not  (they  are  i.i.d),  so  the  general  theory  developed  in  the 
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previous  sections  is  applicable  here  for  computing  conditional  probabilities. 

We  now  consider  several  examples  of  synergy  of  cl  SVM  rules  obtained  under  different  circumstances: 

1.  Synergy  of  cl  rules  obtained  using  the  same  training  data  but  different  kernels. 

2.  Synergy  of  d  rulse  obtained  using  different  training  data  but  the  same  kernel. 

3.  Synergy  of  d  classes  classification  problem  using  cl  one  versus  the  rest  of  the  rules. 

First,  we  show  that  the  accuracy  of  classification  using  synergy  of  SVM  rules  that  use  different  kernels  can 
be  much  higher  than  the  accuracy  of  a  rule  based  on  any  kernel.  The  idea  of  using  several  SVMs  as  ensemble 
SVM  (such  as  [11])  was  used  in  the  past  for  providing  improved  classification  performance;  however,  these 
approaches  did  not  leverage  the  main  monotonicity  property  of  SVM.  The  effect  of  synergy,  which  is 
estimated  by  the  number  of  additional  training  examples  in  training  data  required  to  achieve  comparable  to 
synergy  level  of  accuracy,  can  be  significant. 


We  selected  the  following  9  calibration  data  sets  from  UCI  Machine  Learning  Repository  [12]:  Covertype, 
Adult,  Tic-tac-toe,  Diabetes,  Australian,  Spambase,  MONK’s-1,  MONK’s-2,  and  Bank  marketing.  Our 
selection  of  these  specific  data  sets  was  driven  by  the  desire  to  ensure  statistical  reliability  of  targeted 
estimates,  which  translated  into  availability  of  relatively  large  test  data  set  (containing  at  least  150  samples). 
Specific  breakdowns  for  the  corresponding  training  and  test  sets  are  listed  in  Table  1.  For  each  of  these  9  data 
sets,  we  constructed  10  random  realizations  of  training  and  test  data  sets;  for  each  of  these  10  realizations,  we 
trained  three  SVMs  with  different  kernels:  with  RBF  kernel,  with  INK-Spline  kernel,  and  with  linear  kernel. 
The  averaged  test  errors  of  the  constructed  SVMs  are  listed  in  Table  2. 


Table  1,  Calibration  Data  Sets  from  UCI  Machine  Learning  Repository. 


Data  set 

Training 

Test 

Features 

Covertype 

300 

3000 

54 

Adult 

300 

26147 

123 

Tic-tac-toe 

300 

658 

27 

Diabetes 

576 

192 

8 

Australian 

517 

173 

14 

Spambase 

300 

4301 

57 

MONK’s-1 

124 

432 

6 

MONK’s-2 

169 

432 

6 

Bank 

300 

4221 

16 

Constructed  SVMs  provide  binary  classifications  y  and  scores  s.  Additional  performance  improvements 
are  possible  by  intelligent  leveraging  of  the  results  of  these  classifications. 


We  compared  our  approach  with  the  baseline  method  of  voting  on  classification  results  of  all  three 
classifications  obtained  from  three  different  kernels  (since  we  had  odd  number  of  kernels,  we  did  not  need 
any  tie-breaking  in  that  vote).  The  first  column  of  Table  2  shows  the  averaged  test  errors  of  that  voting 
approach. 

The  second  column  of  Table  2  shows  the  averaged  test  errors  of  our  synergy  approach.  Specifically,  the 
data  in  the  second  column  are  based  on  constructing  a  3-dimensional  monotonic  conditional  probability 
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function  from  RKHS  associated  with  additive  kernel,  on  triples  of  SVM  scores  s.  In  this  column,  we  assigned 
the  classification  labels  y  based  on  the  sign  of  the  difference  between  3-dimensional  conditional  probability 
and  the  threshold  value  1/2. 


The  last  column  of  Table  2  contains  relative  performance  gain  (i.e.,  relative  decrease  of  error  rate) 
delivered  by  the  proposed  synergy  approach  over  the  benchmark  voting  algorithm. 


Table  2.  Synergy  of  SVMs  with  RBF,  INK-Spline,  and  Linear  Kernels. 


Data  set 

Voting 

Synergy 

Gain 

Covertype 

27.83% 

28.96% 

-4.05% 

Adult 

20.07% 

19.08% 

4.93% 

Tic-tac-toe 

1.95% 

1.75% 

10.16% 

Diabetes 

24.53% 

23.39% 

4.67% 

Australian 

12.02% 

12.54% 

-4.33% 

Spambase 

8.96% 

8.44% 

5.80% 

MONK'S- 1 

22.80% 

20.16% 

11.57% 

MONK’s-2 

19.31% 

16.23% 

15.95% 

Bank 

12.79% 

11.73% 

8.29% 

The  results  demonstrate  the  consistent  performance  advantage  of  synergy  approach  over  its  empirical 
alternative  in  most  of  the  cases  (for  7  data  sets  out  of  9);  for  some  data  sets  this  advantage  is  relatively  small, 
but  for  others  it  is  substantial  (in  relative  terms). 


This  substantial  performance  improvement  of  synergy  can  be  also  viewed  as  a  viable  alternative  to  brute 
force  approaches  relying  on  accumulation  of  (big)  data.  Indeed,  for  the  already  considered  Adult  data  set,  we 
compared  results  of  our  synergy  approach  on  a  training  data  set  consisting  of  300  samples  to  an  alternative 
approach  relying  on  training  SVM  algorithms  on  larger  training  data  sets.  Specifically,  we  trained  SVMs  with 
RBF  kernel  and  INK-Spline  kernel  on  Adult  data  sets  containing  1,000  and  3,000  samples.  The  results,  shown 
in  Table  3,  suggest  that  synergy  of  two  rules,  even  on  training  data  set  of  limited  size,  can  be  better  that 
straightforward  SVMs  on  training  data  sets  of  much  larger  sizes  (in  this  example,  equivalent  to  the  increase 
of  training  sample  by  more  than  a  factor  of  10). 


Table  3.  Synergy  Versus  Training  Size  Increase:  Ensemble. 


Training  size 

300 

1000 

3000 

RBF 

20.95% 

19.21% 

18.49% 

INK-Spline 

19.77% 

18.72% 

18.38% 

Synergy 

17.92% 

- 

- 

Suppose  now  we  are  dealing  with  “big  data”  situation,  where  the  number  L  of  elements  in  the  training  data 
set 


(68) 

is  large.  Consider  the  SVM  method  that  uses  a  universal  kernel.  A  universal  kernel  (for  example,  RBF)  can 
approximate  well  any  bounded  continuous  function.  Generally  speaking,  with  the  increase  of  size  £  of  training 
data,  the  expected  error  rate  of  the  obtained  SVM  rule  monotonically  converges  to  the  Bayesian  rule  (here  the 
expectation  is  taken  both  over  the  rules  obtained  from  different  training  data  of  the  same  size  £  and  over  test 
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data).  The  typical  learning  curve  shows  the  dependence  of  that  expected  error  rate  on  the  size  •£  of  training 
data  as  a  hyperbola-looking  curve  consisting  of  two  parts:  the  beginning  of  the  curve,  where  the  error  rate 
falls  steeply  with  the  increase  of  E,  and  the  tail  of  the  curve,  where  the  error  rate  slowly  converges  to  the 
Bayesian  solution.  Suppose  that  the  transition  from  the  “steeply  falling”  part  of  the  curve  to  the  “slowly 
decreasing”  part  of  the  curve  (sometimes  referred  to  as  the  “knee”  of  the  curve)  occurs  for  some  i  .  Assuming 
that  large  number  L  in  is  greater  than  l  ,  we  partition  the  training  data  into  J  subsets  containing  £  elements 
each  (here  L=Ji  and  £>£*  as  well).  On  each  of  these  /  disjoint  training  subsets  we  construct  its  own  SVM  rule 
(independent  of  other  rules).  For  each  of  these  SVM  rules,  we  construct  its  own  one-dimensional  monotonic 
conditional  probability  function  P  (y=  1  l.v'), 


Then,  using  these  /  one-dimensional  monotonic  condition  probability  functions,  we  construct  the  /- 
dimensional  ( s=(sl,...,sJ )  )  conditional  probability  function  as  follows: 

1  J 

Psyn(y=l\s)=J  ZPt(y=  1IA  (69) 

t=  1 

The  Synergy  decision  rule  in  this  case  has  the  form 

(a^11*)'!)'  (70) 

Note  that  above  conditional  probability  function  forms  an  unbiased  estimate  of  the  values  of  learning  curve 
describing  conditional  probability  for  training  data  of  (different)  size  E.  Since  the  training  data  for  different  t 
are  independent,  the  averaging  of  /  conditional  probability  values  decreases  the  variance  of  resulting 
conditional  probability  by  a  factor  of  /.  In  this  approach,  by  choosing  an  appropriate  value  of  E,  one  can 
optimally  solve  the  bias-variance  dilemma. 

To  illustrate  this  approach,  we  again  used  Adult  data  set.  Specifically,  we  trained  SVMs  with  RBF  kernel 
on  Adult  data  sets  containing  900,  1,000  and  3,000  samples.  For  the  first  of  these  samples  (containing  900 
elements),  we  also  executed  the  following  procedure:  we  split  it  into  three  subsets  containing  300  elements 
each,  trained  RBF  SVM  on  each  of  them,  and  then  constructed  two  combined  decision  rules:  (1)  voting  on 
the  labels  of  three  auxiliary  SVMs,  and  (2)  synergy  of  three  SVMs  as  described  in  this  section.  The  results, 
shown  in  Table  4,  suggest  that  Synergy  of  rules  on  disjoint  data  sets  can  be  better  that  straightforward  SVMs 
on  training  data  sets  of  much  larger  sizes  (in  this  example,  equivalent  to  the  increase  of  training  sample  by  a 
factor  of  3). 

Table  4.  Synergy  Versus  Training  Size  Increase:  Bagging. 


Training  size 

300 

300 

300 

900 

1000 

3000 

RBF  SVM 

Voting  on  3  subsets 

Synergy  on  3  subsets 

20.77% 

N/A 

N/A 

19.06% 

N/A 

N/A 

21.40% 

N/A 

N/A 

20.01% 

19.44% 

18.52% 

19.21% 

18.49% 

Comparison  of  Table  3  and  Table  4  suggests  that  synergy  of  SVMs  with  different  SVM  kernels  obtained 
on  the  same  data  set  may  be  more  beneficial  (equivalent  to  ten-fold  increase  of  training  sample  size)  than  the 
synergy  of  SVMs  with  the  same  kernel  obtained  on  different  subsets  of  that  data  set  (equivalent  to  three-fold 
increase  of  training  sample  size). 
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Thus  it  is  reasonable  to  assume  that,  for  big  data  set.  Synergy  of  SVM  rules  obtained  on  different  training 
data  and  Synergy  of  SVM  rules  with  different  kernels  can  be  unified  to  create  an  even  more  accurate  synergy 
rule.  This  unification  can  be  implemented  in  the  following  manner. 


Consider  d  kernels  Kpcjc'),  k=l,...,d 
construct  the  corresponding  condition  probability  function 


.  For  each  of  these  kernels,  we 


1 


J 


PJy=\\s(r))=J  X  Pt(y=l\s\r)), 
t=  1 


(71) 


where  we  have  denoted  by  P  (y=ll/(r))  the  conditional  probability  function  estimated  for  the  rule  with  kernel 
Kpc,x')  and  for  the  jth  subset  of  training  data  with  the  fixed  I.  Let  introduce  the  vector  p=(p1,...,pr)  where 


pr=Psyn(y=\\s(r)),  r=l,...,d.  (72) 

Using  these  vectors,  we  estimate  the  corresponding  d-dimensional  conditional  probability  function 

Psyn{y=l\p)=PsyP=l\pl,..,pd)  (73) 

The  resulting  double  reinforced  Synergy  rule  has  the  form 

(74) 

We  now  consider  for  multi-class  classification  -  an  important  problem  in  pattern  recognition.  In  contrast  to 
methods  for  constructing  two-class  classification  rules,  which  have  solid  statistical  justifications,  existing 
methods  for  constructing  d> 2  class  classification  rules  are  based  on  heuristics. 


One  of  the  most  popular  heuristics,  one  versus  rest  (OVR),  suggests  first  to  solve  the  following  d  two- 
class  classification  problems:  in  problem  number  k  (where  k=l,...,d),  the  examples  of  class  k  are  considered 
as  examples  of  the  first  class  and  examples  of  the  all  other  classes  1,. .  .,(k-l),(k+l),. .  .,d  are  considered  as  the 
second  class.  Using  OVR  approach,  one  constructs  d  different  two-class  classification  rules 

y=®(fk(x))  k=l,...,d.  (75) 

The  new  object  x*  is  assigned  to  the  class  k,  where  kth  rule  provides  the  maximum  score  for  x*: 


k=argmax{.v  H4  5  •  •  •  5 *S* -R  } ,  where 


s  *=//**)• 


(76) 


This  method  of  d-class  classification  is  not  based  on  a  clear  statistical  foundation.  Another  common  heuristics 

2 

called  one  versus  one  (OVO):  it  suggests  to  solve  Cd  two-class  classification  problems  separating  all  possible 
pairs  of  classes.  To  classify  a  new  object  x*,  one  uses  a  voting  scheme  based  on  the  obtained  &d  rules. 


Here  we  implement  the  following  multi-class  classification  procedure.  For  every  k  (where  k=l,...,d),  we 
solve  the  corresponding  OVR  SVM  problem,  for  which  all  the  elements  with  the  original  label  k  are  marked 
with  y=  1,  while  all  the  other  elements  are  marked  with  y=0.  Upon  solving  all  these  d  problems,  we  can,  for 
any  given  vector  x  and  any  class  k,  compute  its  score  sAx)  provided  by  the  kth  SVM  rule.  After  that,  we  merge 

the  scores  of  these  auxiliary  SVM  rules  following  the  approach  described  in  our  paper  [13]. 
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We  compared  our  synergy  approach  with  the  standard  OVR  approach  for  the  data  sets  Vehicle,  Waveform, 
and  Cardiotocography  from  UCI  Machine  Learning  Repository  [12].  Training  and  test  sets  were  selected 
randomly  from  these  data  sets;  the  number  of  elements  in  each  are  shown  in  Table  5;  the  table  also  shows  the 
error  rates  achieved  by  OVR  and  synergy  algorithm,  along  with  relative  performance  gain  obtained  with  our 
approach.  The  results  confirm  the  viability  of  our  framework. 
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Table  5.  Synergy  for  Multi-Class  Classificaion. 


Data  set 

Classes 

Features 

Training 

Test 

OVR 

Synerg 

y 

Gain 

Vehicle 

4 

18 

709 

236 

17.45% 

14.15% 

18.91% 

Waveform 

3 

40 

200 

4800 

20.10% 

18.31% 

8.90% 

Cardiotocography 

3 

21 

300 

1826 

15.83% 

12.05% 

23.87% 

Thus,  we  showed  that: 


1.  Scores  5=(51,. .  ,,sd)  of  several  monotonic  classifiers  (for  example,  SVMs)  that  solve  the  same  pattern 
recognition  problem  can  be  transformed  into  multi-dimensional  monotonic  conditional  probability  functions 
P(yh)  (probability  of  class  y  given  scores  5). 

2.  There  exists  an  effective  algorithm  for  such  transformation. 

3.  Classification  rules  obtained  on  the  basis  of  constructed  conditional  probability  functions  significantly 
improve  performance,  especially  in  multi-class  classification  cases. 


3.2  Data-Driven  Approach 


In  this  section,  we  focus  on  the  standard  general  binary  classification  problem  setting.  In  this  problem,  there 
is  training  set  consisting  of  L  samples,  each  being  an  N-dimensional  vector  that  belongs  to  one  of  two  classes, 
positive  and  negative,  traditionally  labeled  as  -1  and  +1.  For  this  setting,  a  decision  rule  has  to  be  learned  on 
the  given  set  of  L  vectors,  which  can  then  be  applied  for  classification  of  any  arbitrary  N-dimensional  vector 
into  one  of  two  classes,  -1  or  +1  with  minimum  possible  error  rate  (in  other  words,  with  minimum  possible 
probability  of  misclassification,  i.e.,  assignment  of  the  wrong  class  label  to  the  vector).  SVM  is  the  current 
best-in-class  algorithm  for  solving  this  type  of  problem. 

We  explored  the  applicability  of  the  current  feature  construction  approach  to  a  special  area  of  discriminative 
learning  (Figure  3)  -  Learning  Using  Privileged  Information  (LUPI),  which  essentially  relies  on  two  distinct 
classes  of  features  (standard  and  privileged).  The  developed  approach  (in  our  papers  [14]  [15]  [16]  [17]  [18]) 
of  leveraging  derived  features  was  successfully  carried  over  to  the  area  of  Learning  using  Privileged 
Information  (LUPI)  by  using  the  regression  mechanism  of  generating  derived  features  (the  one  we  originally 
proposed  using  for  SVM)  for  approximating  the  privileged  features  in  LUPI  using  standard  ones. 
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standard  features 


m  positive  class 
]  negative  class 


standard  features 


privileged  features 


□I 

■I 

□I 

□l 


Learning  decision  rule 

Applying  decision  rule/ 

on  training  set 

on  test  set 

l  r\  n 

□  Decision  rule  is  learned  on  both  standard 
(low  resolution)  and  privileged  (high 
resolution)  features 

□  Privileged  (high  resolution)  features  are 
partially  learned  from  standard  features 
(low  resolution) 

□  Decision  rule  works  on  standard  features 
(low  resolution)  and  those  privileged  (high 
resolution)  features  that  are  learned  from 
standard  features  (low  resolution) 


Knowledge  transfer 


Figure  3.  Novel  Framework  for  Learning  Using  Privileged  Information. 


By  using  these  approach,  we  were  capable  to  capture  some  of  the  information  contained  in  the  privileged 
features  and,  by  expressing  it  through  standard  features  in  the  form  of  regression,  we  opened  a  new  possibility 
of  solving  LUPI  problems  not  just  by  controlling  similarity  between  vectors  (as  was  done  before),  but  by 
information  transfer  from  the  space  of  privileged  features  to  the  space  of  standard  ones.  Even  more 
importantly,  this  information  transfer  method  essentially  converts  LUPI  in  a  special  form  of  SVM,  thus 
completely  resolving  the  main  computational  issue  of  scalability  that  was  the  key  disadvantage  of  LUPI  up 
until  now.  Indeed,  current  LUPI  is  only  capable  of  handling  about  300  training  example  in  a  reasonable  time, 
whereas,  with  the  new  DILEGENT  approach,  we  were  able  to  process  a  sample  size  containing  2000  vectors 
(representing  pixel  data  for  image  classification)  without  any  computational  problems. 

In  another  example,  using  set  of  of  pre-processed  video  snapshots  of  a  terrain,  one  has  to  separate  pictures 
with  specific  targets  on  it  (class  +1)  from  pictures  where  there  are  no  such  targets  (class  -1).  The  original 
videos  were  made  using  aerial  cameras  of  different  resolutions:  a  low  resolution  camera  with  wide  view 
(capable  to  cover  large  areas  quickly)  and  a  high  resolution  camera  with  narrow  view  (covering  smaller  areas 
and  thus  unsuitable  for  fast  coverage  of  terrain).  The  goal  was  to  make  judgements  about  presence  or  absence 
of  targets  using  wide  view  camera  that  could  quickly  span  large  surface  areas.  The  narrow  view  camera  could 
be  used  during  training  phase  for  zooming  in  the  areas  where  target  presence  was  suspected,  but  it  was  not  to 
be  used  during  actual  operation  of  the  monitoring  system,  i.e.,  during  test  phase.  Thus,  the  wide  view  camera 
with  low  resolution  corresponds  to  standard  information,  whereas  the  narrow  view  camera  with  high 
resolution  corresponds  to  privileged  information. 

Modem  data  analysis  problems  require  construction  of  decision  rules  that  operate  in  high  dimensional  spaces. 
Thus,  in  order  to  obtain  good  decision  rules,  one  has  to  train  learning  algorithms  using  a  huge  number  of 
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examples  (tens  or  hundreds  of  thousands).  According  to  the  statistical  learning  theory  [19]  [20],  which 
provides  precise  estimates  of  the  convergence  of  learning  algorithms,  even  the  best  state-of-the-art  learning 
algorithms  such  as  Support  Vector  Machines  (SVM)  will  require  plenty  of  training  examples  and  computation 
time.  At  the  same  time,  we  know  that  humans  can  learn  well  from  a  small  number  of  training  examples.  This 
gap  in  learning  performance  between  humans  and  machines  has  been  a  persistent  challenge  from  the  very 
beginning  of  the  computer  era. 

Learning  Using  Privileged  Information  (LUPI)  was  introduced  first  as  an  idea  in  [20]  and  subsequently 
realized  in  several  versions  of  SVM+  algorithms,  generalizing  SVM  with  different  degrees  of  implementation 
complexity  [21]  [22].  The  idea  of  LUPI  is  to  improve  classification  accuracy  with  small  training  datasets  by 
fusing  the  training  data  with  additional  features,  which  are  not  present  in  the  testing  data.  These  features  can 
come  from  an  entirely  different  sensor  modality.  An  example  would  be  the  utilization  of  medical  records 
accompanying  X-ray  images  of  existing  patients  to  better  classify  X-ray  images  of  new  patients  who  have  not 
yet  been  diagnosed. 

In  the  wide  area  aerial  video  exploitation  scenario,  which  is  the  application  considered  by  this  contribution, 
the  additional  features  come  from  high  resolution  videos  or  still  imagery  coincident  with  the  training  data. 
These  additional  data  help  reduce  the  number  of  data  samples  required  to  train  accurate  models  of  class 
distributions,  which  is  important  in  dynamic  environments. 

As  a  development  of  LUPI,  recent  paper  [9]  introduced  the  concept  of  intelligent  learning,  based  on  the  ideas 
of  knowledge  transfer,  which  allows  to  further  improve  classifier  performance.  Improved  scalability  is 
achieved  by  casting  the  problem  within  the  SVM  framework,  for  which  multiple  efficient  algorithms  are 
already  implemented  on  most  platforms;  the  improvement  in  flexibility  is  achieved  by  the  ability  to  utilize  a 
variety  of  methods  within  the  knowledge  transfer  framework. 


Next,  we  formulate,  based  on  the  general  framework  we  developed  in  our  papers  [14]  [15]  [16]  [17]  [18],  two 
specific  types  of  knowledge  transfer  algorithms:  privileged  feature  regression  and  privileged  clustering.  We 
further  consider  a  combination  of  several  knowledge  transfer  models  in  ensemble  type  learning.  We  compare 
the  performance,  measured  in  misclassification  error  rate  and  execution  time,  of  both  types  of  LUPI 
algorithms  with  the  original  SVM+  algorithm.  We  apply  the  algorithms  to  a  wide  area  aerial  video  exploitation 
problem,  where  the  privileged  information  represents  more  expensive  and/or  higher  quality  sensor 
information  available  for  training  data,  but  not  for  testing  data.  Using  the  Minor  Area  Motion  Imagery 
(MAMI)  dataset  recently  collected  by  Air  Force  Research  Laboratory  (Freeman,  2014),  we  demonstrate  that 
knowledge  transfer  approach  to  LUPI  provides  consistently  better  performance  than  the  original  SVM+ 
approach.  Using  an  ensemble  of  several  knowledge  transfer  algorithms,  the  error  rate  is  reduced  by  up  to 
25%.  We  also  demonstrate  a  significant  computational  speedup,  making  LUPI  algorithm  as  scalable  as  the 
standard  SVM. 

The  traditional  machine  learning  paradigm  is  formulated  as  follows.  Given  a  set  of  training  examples,  and  a 
parameterized  collection  of  decision  functions,  find  the  function  that  approximates  the  unknown  decision  rule 
in  the  best  possible  way  [9]  [20].  Formally,  for  a  binary  classification  problem,  we  are  given  L  training  vectors 

xltx2 . xlern, 


and  the  corresponding  labels  y^  y2, ... ,  yL  G  (—1,  +1}. 


In  SVM  approach,  some  kernel  K(Xi,Xj')  is  selected  in  the  space  RN .  Then,  for  positive  penalty  parameters 
Ci,C2,...,Cl,  (usually  equal  to  the  same  number  C  normalized  by  the  ratio  of  positive  or  negative  labels  in  the 
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training  sample,  respectively)  a  quadratic  optimization  problem  is  solved  in  order  to  find  the  values  of 
parameters  011,012,. .  .,cll  that  maximize  the  following  functional 

Si  -±ZijyiyjaiajK(Xi,Xj)  (77) 

subject  to  constraints 


Si  «iyi  =  0,  0  <  ai  <  Cit  i  =  1, ... ,  L 


(78) 


The  parameters  a  1.012,..., a/,  are  used  to  construct  the  decision  function 


/(z)  =  sgn 


(79) 


The  offset  B  is  computed  for  some  training  index  j,  such  that  0  <  <Xj  <  Cj 


B  =  yj 


(80) 


The  parameters  C\,Ci,...,Cl  and  the  kernel  parameters  of  the  SVM  algorithm  are  usually  optimized  by 
executing  cross-validation  over  some  grid  in  the  parameter  space. 


The  LUPI  paradigm  can  be  described  as  learning  with  a  teacher.  In  the  classical  model  of  learning  described 
above,  the  teacher  supplies  the  set  of  labels,  and  his/her  role  is  trivial.  In  the  LUPI  model,  the  teacher  supplies 
students  with  non-trivial  additional  information  in  various  forms,  such  as  images,  explanations,  metaphors, 
etc.  This  additional  information  is  present  only  during  the  training  stage  (when  the  teacher  is  available),  and 
will  not  be  present  during  the  test  stage  (when  the  teacher  is  not  available). 


Formally,  LUPI  approach  is  described  as  follows:  at  the  training  stage,  we  have  L  standard  vectors 

XltX2 . XlERn,  (81) 

and,  corresponding  to  them,  L  privileged  vectors 

x1,x2,  ...,xL  G  Rm  (82) 


and  L  labels 

yi»y2*  ->yL  E  {-i,+i}-  (83) 

The  decision  rule  has  to  operate  only  on  the  standard  V-dimensional  space  RN ,  as  the  test  vectors  belong  to 
that  space,  and  no  privileged  information  will  be  available. 


The  original  algorithm  implementing  the  LUPI  paradigm,  SVM+,  was  designed  as  a  generalization  of  the 
SVM  algorithm  [21]  [22],  In  SVM+,  two  kernels  KiXj.Xj)  and  k(xi,Xj )  are  selected  respectively  in  the  standard 
space  Rn  and  the  privileged  space  RM.  Then,  for  a  fixed  positive  structural  parameters  k  and  y  and  positive 
penalty  parameters  Ci,C2,...,Cl,  SVM+  solves  quadratic  programming  problem  of  finding  the  parameters 
ai,a2,...,az.  and  81,82,...,  8l  that  maximize  the  functional 


X  ai  ~  \  X  ai  aj K(Xi,X^ 

i  ij 


(84) 
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subject  to  constraints 


T^^yty/Oi  “  $)(«/  -  8j)k{xi,xj), 


ij 


i  i 


^  am  =  0, 

i 

o  <at<  kCi,  0  <  Si  <  Cit 

i  =  1, ... ,  L 


(85) 


There  are  other  versions  of  SVM+  with  similar  performance  characteristics.  The  version  used  in  this  report  is 
based  on  [17].  The  parameters  are  used  to  construct  the  decision  function 


/(z)  =  sgn(£i  aiyiK(Xi,  z )  +  5)  (86) 

where  the  offset  B  is  computed  for  some  index  j,  such  that  0  <  ccj  <  Cj  and  0  <  6)  <  Cf. 

B=yj(i (87) 

\  ij  ij  / 

The  parameters  Ci,C2,...,Cl.  K,y  and  kernel  parameters  of  the  SVM+  algorithm  are  optimized  by  grid  search 
over  the  parameter  space  similar  to  the  SVM  algorithm. 

With  the  growth  of  the  number  of  training  examples,  the  LUPI  approach  converges  to  the  optimal  solution 
much  faster  than  the  classical  approaches.  As  a  result,  LUPI  can  require  significantly  fewer  training  examples 
than  classical  approaches  to  achieve  the  same  level  of  performance.  In  some  cases,  LUPI  can  achieve  the 
same  performance  by  training  with  VTexamples  where  the  classical  approach  uses  L  examples  -  requiring, 
for  example,  only  320  examples  instead  of  100,000  examples  required  by  the  classical  learning  algorithms. 
LUPI  approach  has  been  already  successfully  applied  to  a  diverse  set  of  problems  from  multiple  disciplines 
[23]  [24]  [25], 

While  delivering  performance  results  that  were  previously  impossible  to  achieve  within  the  standard  machine 
learning  paradigm,  the  key  problem  with  existing  LUPI  implementations,  such  as  SVM+  [21]  [22],  was  their 
limited  scalability:  since  the  core  matrix  in  the  quadratic  programming  implementation  of  SVM+  is  poorly 
conditioned  and  SVM+  requires  more  parameters  to  tune,  the  practical  limit  of  training  sample  size  was  about 
200  examples  (for  larger  samples,  algorithms  required  days  and  weeks  to  converge).  Although  specially 
designed  spline  kernels  [10]  allowed  increasing  that  sample  size  to  300-350  examples,  the  scalability  problem 
remained  the  main  obstacle  for  much  wider  applications  of  LUPI. 

To  reiterate,  there  are  several  factors  that  make  SVM+  more  computationally  expensive: 

i.  The  quadratic  optimization  problem  associated  with  SVM+  is  twice  the  size  of  the  one  associated  with 
SVM  and  is  more  complex,  and  so  it  takes  more  time  to  solve; 

ii.  SVM+  has  four  free  parameters  -  twice  the  number  of  free  parameters  in  the  SVM  algorithm.  Tuning 
four  parameters  requires  more  computations. 

iii.  The  core  matrix  in  the  quadratic  optimization  problem  is  usually  ill-conditioned,  which  significantly 
slows  down  the  quadratic  optimization  process. 
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All  these  factors,  taken  together,  define  the  practical  limit  of  training  sample  size  for  the  SVM+  algorithm  at 
about  200-300  examples. 

Recently,  the  concept  of  LUPI  was  theoretically  expanded  to  include  more  general  knowledge  transfer 
mechanisms  [14]  [15]  [16]  allowing  to  transfer  knowledge  from  the  space  of  privileged  information  (space  of 
Teacher's  explanations)  to  the  space  where  decision  rule  is  constructed. 

We  now  formulate  a  specific  algorithm  [17]  implementing  knowledge  transfer  via  regression  of  privileged 
features.  Here,  we  consider  two  versions  of  the  privileged  regression  approach:  one  that  uses  linear  ridge 
regression  for  approximating  privileged  features,  and  another  one  that  uses  kernel  regression  with  RBF 
functions.  In  both  versions,  regularization  parameters  (and  for  non-linear  regression,  the  Gaussian  parameter 
of  RBF  function)  are  selected  using  2-fold  cross  validation. 

Consider  a  training  set  consisting  of  L  standard  vectors  X1,X2, ...  ,XL  E  RN and  corresponding  to  them  L 
privileged  vectors  x1,x2,  ...,xL  E  RM .  We  assume  that  each  of  the  vectors  Xt  is  labeled  with  yt  E  (—1,  +1}. 
In  matrix  form  (where  the  subscripts  denote  vector  indices  and  superscripts  denote  vector  elements),  the 
training  set  has  the  form 


hi 

h 

xt  ■ 

..  Xx 

4 

4  - 

..  xf\ 

y2 

A 

A  ■ 

..  X2 

4 

4  ■ 

■■  x2 

I  (88) 

\yL 

A 

*1  '■ 

4 

4  ■ 

..  X?) 

We  now  describe  the  way  knowledge  transfer  LUPI  algorithm  works  for  this  training  set  if  the  parameters  of 
the  desired  SVM  classification  decision  rule  are  already  known;  if  they  are  not  known,  the  parameter  search 
is  executed  as  described  further. 


Train-1.  For  each  j  —  1,2 M,  do  the  following.  Using  A-dimensional  vectors  X1,X2, ...  ,XL  as 
explanatory  variables  and  corresponding  scalar  values  x[,  x]2,  ...,x]Las  response  variables,  construct  a  (linear 
or  nonlinear)  regression  function  q)j  so  that 

(pjiXlXj . Af)  =  z[  *  x{, 

<Pj(X\,X | . X?)=zJ2  *  x{,  (89) 

<Pj(*bXl . Xnl)  =  z[  *  x[ 


Train-2.  Use  the  values  z[  (where  j  —  1,2, ... . ,  M  and  =  1,2, ... . ,  L  )  constructed  in  the  previous  step  to 
augment  the  vectors  Xlt  X2, ...  ,XL  from  /V-dimcnsional  space  RN  to  form  vectors  Zx,  Z2, ... ,  ZLfrom  ( N  +  M)- 
dimensional  space  RN+M;  these  vectors  have  the  matrix  form 


zr 

7th 

II 

xi  . 

...  zr 

71 

Z1 

72 

Z1 

..  zf). 

=  (X\ 

X\  .. 

..  X" 

71 

z2 

72 

z2 

,.  z«), 

4 

=  (Xl 

4  ■ 

-  X? 

71 

ZL 

72 

ZL 

..  zf). 

(90) 


As  a  result,  for  each  i  —  1,2, ... . ,  L,  the  first  N  elements  of  vector  Z,  constitute  standard  vector  XL,  while 
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the  last  M  elements  of  vector  Zj  constitute  regression-based  approximations  (zj  z2  ...  z(M)  to  privileged 
vector  Xj . 

Train-3.  Create  the  new  (augmented)  training  set  with  vectors  ZUZ2, ... ,  ZL  and  corresponding  labels 
y,y2, ...  ,yL;  in  matrix  form,  the  augmented  training  set  has  the  form 


hi 

xi 

x\  ■ 

..  zf 

71 

Z1 

72 

Z1 

..  zf\ 

y2 

h 

■ 

..  x2 

71 

z2 

72 

Z2 

zM 

..  z2 

)•  (91) 

XYl 

xl 

Xl  '. 

71 

ZL 

72 

ZL 

..  Z?) 

Train-4.  Train  an  SVM  algorithm  with  known  parameters  on  vectors  Z1,Z2, ...  ,ZL  from  (N  +  M)- 
dimensional  space  RN+M  and  construct  the  corresponding  classification  decision  function  F,  which,  when 
applied  to  any  ( N  +  M) -dimensional  vector  Z,  produces  the  classification  output  Y  —  F( Z),  where  Y  G 
{— 1,  +1}.  Otherwise,  if  the  parameters  are  unknown,  carry  out  SVM  parameter  search  as  described  further  in 
sub- section  C. 

The  designed  classification  decision  algorithm  F  can  now  be  applied  to  any  standard  vector  X  from  N- 
dimensional  space  RN  in  following  manner. 

Test-1.  Using  already  constructed  (during  training)  M  regressions  <p1; ... ,  <pM,  construct  M  scalar  values 

z1  =  <Pi(Z),z2  =  <p2(Z),  ...,zM  =  <pM(Z)  (92) 

Test-2.  Construct  ( N  +  M) -dimensional  vector  Z  by  concatenating  these  M  scalar  values  with  N- 
dimensional  vector  X : 

Z  =  (Z1  Z2  ...  XN  z1  z2  ...  zM)  (93) 

Test-3.  Apply  classification  decision  algorithm  F  to  the  constructed  ( N  -I-  M) -dimensional  vector  Z  and 
obtain  the  classification  label  Y  =  F(Z),  where  Y  G  {— 1, +1}.  This  label  Y  is  the  desired  classification  of 
standard  V -dimensional  vector  Z. 

In  other  words,  the  designed  (V-t-M)-dimensional  decision  rule  is  used  for  any  test  V-dimcnsional  test  vector 
Z  in  three  steps:  (1)  from  already  constructed  (at  training  stage)  M  multivariate  regressions,  compute  M 
approximations  to  missing  privileged  features,  (2)  concatenate  the  V-dimensional  test  vector  Z  with  these  M 
approximations,  and  (3)  apply  the  decision  rule  to  the  resulting  (V+M)-dimensional  augmented  test  vector. 
The  full  feature  space  thus  combines  the  original  features  and  the  new  features  from  a  domain  knowledge 
(similar  to  [1]  [26]  [27],  where  domain  knowledge  is  represented  by  the  privileged  feature  space. 

As  already  mentioned,  the  training  procedure  described  above  is  applicable  if  the  parameters  of  the  desired 
SVM  classification  decision  rule  are  already  known  (for  instance,  for  SVM  with  RBF  kernel  there  are  SVM 
penalty  parameter  C  and  Gaussian  parameter  y).  If  they  are  not  known  (which  is  usually  the  case),  they  are 
selected  using  grid  search  over  a  pre-defined  set  of  parameter  vectors  q.  In  case  of  SVM  with  RBF  kernel, 
this  set  is  2-dimensional.  In  general,  we  assume  that  this  set  consists  of  P  vectors  [q1,  q2, ... ,  qP}.  The  search 
is  performed  in  the  following  way  (for  simplicity,  we  describe  it  in  the  case  of  6-fold  cross-validation). 

Param-1.  The  training  set  X  is  randomly  partitioned  into  six  subsets  Z1(  X2, ... ,  Z6  of  approximately  equal 

size. 

Param-2.  For  each  i  —  1,2  ... . ,  P,  the  cross-validation  error  rate  £)of  the  algorithm  with  parameter  vector 
qt  is  computed  in  the  following  way  . 
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Param-2-1.  For  each  k  —  1,2  ... .  ,6,  the  following  operation  is  executed. 

Param-2-1-1.  Auxiliary  sets  Xtrain  and  Xtest  are  formed:  Xtest  is  the  kth  of  sets  X1,  X2, ... ,  X6,  and  Atrainis 
the  union  of  the  other  five  subsets. 

Param-2-1-2.  Apply  steps  Train-1,  Train-2,  Train-3,  Train-4  to  Xtrainto  compute  the  classification 
decision  function  Pik- 

Param-2-1 -3.  Apply  the  classification  decision  function  Fik  to  Atestaccording  to  Test-1,  Test-2,  Test-3 
and  compute  the  resulting  classification  error  rate  Eik. 

Param-2-2.  Compute  the  average  EL  of  six  error  rates  Eilt  Ei2, ... ,  Ei6. 

Param-3.  Among  all  i  =  1,2  ... . ,  P,  select  the  parameter  vector  qt  corresponding  to  the  smallest  error  rate 
Et. 

In  terms  of  scalability,  it  is  clear  that  both  linear  and  nonlinear  versions  of  LUPI  knowledge  transfer 
algorithms  avoid  the  main  problem  of  SVM+:  while  the  additional  step  of  calculating  M  multivariate 
regressions  takes  some  time,  the  regression  is  performed  only  once  during  the  whole  grid  search,  and,  most 
importantly,  the  augmented  training  data  are  then  processed  by  standard  scalable  SVM  implementations. 

In  terms  of  performance,  it  is  important  to  gauge  it  properly.  Assuming  that  the  quality  of  the  designed  LUPI 
solution  is  A  (i.e.,  its  error  rate  is  A%),  we  can  compare  it  with  the  quality  of  two  standard  learning  solutions: 

(1)  Only  standard  features  are  used,  i.e.,  SVM  in  V-dimensional  space. 

(2)  Both  standard  and  privileged  features  are  used,  i.e.,  SVM  on  (N+M) -dimensional  space. 

The  solution  (1)  corresponds  to  the  pre-LUPI  situation  when  privileged  features  may  exist,  but  they  are 
discarded  since  there  is  no  mechanism  to  take  them  into  account.  The  solution  (2)  is  the  ideal  situation,  when 
all  privileged  features  do  not  disappear  during  the  test  phase,  but  instead  remain  as  standard  features. 
Assuming  that  the  solution  (1)  has  quality  B  (i.e.,  its  error  rate  is  B% ),  while  the  solution  (2)  has  quality  C 
(i.e.,  its  error  rate  is  C%),  we  can  generally  expect  that  C<A<B.  Indeed,  C  should  be  the  smallest  one  among 
the  three,  since  all  privileged  features  are  actually  standard,  so  the  power  of  SVM  solution  can  be  used  to  its 
fullest.  Also,  B  should  be  the  largest  one  among  the  three  since  it  corresponds  to  pre-LUPI  situation  of 
classical  learning,  where  all  privileged  features  are  ignored.  So  the  quality  of  our  LUPI  solution,  being 
“sandwiched”  between  C  and  B ,  can  be  naturally  evaluated  by  measuring  how  much  progress  LUPI  could 
make  within  the  performance  gap  B-C.  Thus,  the  metric  (B-A)HB-C)  can  be  used  to  assess  the  relative  quality 
of  the  LUPI  solution. 

Note  that  if  gap  B-C  is  small,  it  means  that  privileged  information  is  not  particularly  relevant  (its  knowledge 
in  solution  (2)  does  not  improve  much  the  quality  of  solution  (1)),  then  it’s  probably  hopeless  to  apply  LUPI 
anyway  -  there  is  no  space  for  improvement  for  that.  Given  our  experience,  it  is  probably  safe  to  start  looking 
for  LUPI  solution  if  the  gap  B-C  is  at  least  1.5-2  times  larger  than  C.  The  improvement  metric  ( B-A)/{B-C ) 
of  about  20-30%  or  more  would  then  constitute  success  of  LUPI  approach. 

The  MAMI  classification  dataset  [28]  is  based  on  mover  extraction  from  airborne  wide  area  imagery  collected 
by  AFRL.  The  goal  of  the  classifier  is  to  improve  dismount  detection  in  low-resolution  motion  imagery.  As 
described,  for  instance,  in  [29],  dismount  tracking  is  the  concept  of  tracking  a  person  either  by  direct 
observation  or  indirectly  by  inference,  such  as  determining  where  the  person  was  when  exiting  direct  view. 
Dismount  tracking  is  an  important  security  application  in  that  a  nominated  person  can  be  tracked  through 
various  activities  to  predict  and  mitigate  harmful  actions,  establish  intent,  and  determine  social  group 
association  [29] .  In  its  native  resolution,  the  MAMI  imagery  has  dismounts  of  20-pixel  height,  while  imagery 
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downsampled  to  l/8th  resolution  has  2-3  pixel  high  dismounts.  Motion  imagery  is  generated  at  about  15  Hz. 
The  data  were  collected  by  multiple  cameras  with  different  resolutions. 

We  used  test  data  from  June  26,  2013.  These  images  are  centered  on  a  grassy  area  where  a  picnic  is  taking 
place.  There  are  dozens  of  people  milling,  walking,  running,  sitting,  and  playing  volleyball  and  other  games. 
There  are  also  some  scripted  dismount  activities  and  parked  and  moving  vehicles.  Most  dismounts  are  taken 
from  the  picnic  area  which  was  in  Camera  4's  Field  of  View  (FOV),  see  Figure  4.  Some  dismounts  are  also 
in  the  parking  lots,  and  near  roads  and  paths.  Most  false  alarms  are  from  outside  the  picnic  area,  especially  in 
the  Camera  3  FOV,  which  points  below  Camera  4.  False  alarms  were  caused  by  parallax,  trees,  poles, 
buildings,  vehicles,  reflections,  image  noise,  and  registration  errors. 

To  test  the  LUPI  algorithms,  we  used  the  images  at  native  and  reduced  resolutions.  Low-resolution  1/N th 
images  were  formed  by  averaging  the  intensities  of  NxN  blocks  of  pixels  into  one  pixel  and  then  interpolating 
the  reduced  image  back  to  the  original  resolution.  We  grouped  images  into  sequences,  each  representing  10- 
15  seconds  of  data.  For  each  sequence,  we  designated  a  target  area,  where  we  expected  to  find  multiple 
dismounts,  and  a  false  alarm  area,  where  we  expect  to  see  few  or  no  dismounts.  We  registered  the  images  in 
a  sequence  and  extracted  tracklets. 

We  built  a  background  model  and  used  differences  between  the  images  and  the  background  models  to  detect 
mover  candidates.  Detections  from  the  frames  of  sequences  are  linked  together  to  form  tracks.  We  tuned  the 
tracker  at  each  resolution  to  have  70-90%  Probability  of  Detection  (PD)  on  the  true  dismount  samples.  We 
describe  the  detected  mover  candidates  by  a  set  of  features  and  use  LUPI  techniques  to  separate  the  true 
movers  from  false  alarms. 


Figure  4.  Dismounts  Detected  in  MAMI  Data. 


We  built  the  following  features  for  each  detected  mover:  45  object-level  features  (including  blob,  kinematic, 
track,  fraction  of  frames  in  which  this  mover  was  detected,  mean  of  length  measurements  for  this  mover, 
divided  by  standard  deviation,  eccentricity  of  best  fit  ellipse,  standard  deviation  of  orientation  measurements 
over  all  frames,  median  of  the  mean  pixel  values  for  each  observation  of  this  mover,  etc.),  20  activity-level 
features  (turns,  accelerations,  starts,  stops,  etc.),  64  SURF  features,  and  4  Gradient  features,  for  a  total  of  113 
features. 

We  selected  14  of  these  features  (such  as  area,  eccentricity)  as  reasonably  robust  to  be  used  as  the  standard 
features  to  be  used  in  the  design  classifier,  while  we  use  all  features  at  the  full  resolution  as  the  privileged 
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space.  We  constructed  the  dataset  consisting  of  856  objects  described  by  127  features  -  14  standard  and  113 
privileged.  567  objects  were  labeled  as  positive  class  (dismounts),  and  289  -  as  negative  (false  alarms).  We 
extracted  567  positive  and  289  negative  samples  from  the  MAMI  data.  In  each  experiment,  we  used  a  set  of 
30  randomly  sampled  sets  of  training  size  L=  80,  160,  320,  640,  from  which  we  formed  balanced  training  sets 
with  equal  number  of  objects  Nt=27,  54,  108,  216  of  each  of  2  classes  to  construct  a  classifier.  For  each 
training  set,  we  used  K  random  partitions  in  80%  training,  20%  tuning  to  optimize  parameters  of  each  of  the 
algorithms.  Remaining  objects  that  were  not  selected  in  training  set  form  holdout  set.  So  we  have  30  pairs  of 
training  and  holdout  sets. 

We  optimized  the  baseline  SVM  over  the  grid  C  =  0.1,1,10,  y  —  2~6,  2~4 , 2~2,  1,  22 ,  24,  26  ,  using  K  =  10; 
SVM  regression  over  the  grid  C  =  0.1,1,10,  y  —  0.1, 1, 10;  LPC  Linear  over  1  to  5  clusters  separately  for 
positive  and  negative  classes  and  C=0.1,l,10.  LPC  Gauss  used  fixed  parameters:  3  and  2  clusters  respectively 

(2  ||  .1 2\ x/2 

for  positive  and  negative  classes,  C=l,  y  —  yN  ^  _1}  La<j\\Xi  —  Xj ||  J  ,  where  Aj.Ay objects  from  training 

set.  Parameters  for  each  training  set  were  optimized  for  the  case  of  Nt=27  and  same  values  used  for  large 
training  samples. 

We  tested  performance  as  weighted  average  error  on  the  30  holdout  sets.  The  obtained  results  suggest  the 
following  conclusions: 

i.  LUPI-regression  algorithm  consistently  outperforms  the  baseline; 

ii.  LPC  algorithm  on  average  outperforms  the  baseline; 

iii.  Scalability  problem  of  LUPI  SVM+  was  confirmed  again  for  samples  larger  than  200-300  elements, 
as  we  can  see  from  the  entry  23391  for  L=640  in  Table  I,  which  corresponds  to  16  days  of  optimization; 

These  observations  demonstrate  that  general  knowledge  transfer  approach  in  LUPI  paradigm  can  be 
implemented  in  scalable  algorithms  successfully  leveraging  privileged  information  by  various  means.  The 
results  also  demonstrate  the  value  of  fusing  multiple  mechanisms  of  knowledge  transfer  within  the  LUPI 
paradigm.  Thus  we  have  formulated  and  implemented  two  new  closely  related  classification  algorithms  within 
the  recently  introduced  approach  of  knowledge  transfer  in  LUPI.  Both  algorithms  successfully  resolve  the 
scalability  problem  of  previous  LUPI  approaches  and  allow  for  diverse  and  scalable  leveraging  of  privileged 
information  in  classification  problems.  We  verified  the  proposed  approaches  using  dismount  detection  problem 
in  AFRL  MAMI  airborne  data  and  demonstrate  that  efficacy  of  the  proposed  algorithms,  both  in  terms  of  their 
performance  and  scalability. 

In  order  to  explore  privileged  information  and  knowledge  transfer  in  more  detail,  we  considered  the  following 
approach. 

Let  us  suppose  that  Intelligent  Teacher  has  some  knowledge  about  the  solution  of  a  specific  pattern 
recognition  problem  and  would  like  to  transfer  this  knowledge  to  Student.  For  example,  Teacher  can  reliably 
recognize  cancer  in  biopsy  images  (in  a  pixel  space  X)  and  would  like  to  transfer  this  skill  to  Student. 

Formally,  this  means  that  Teacher  has  some  function  y=fQ(x )  that  distinguishes  cancer  (Jq(x)=+  1  for  cancer 

and  /q(.v')=-  I  for  non-cancer)  in  the  pixel  space  X.  Unfortunately,  Teacher  does  not  know  this  function 

explicitly  (it  only  exists  as  a  neural  net  in  Teacher’s  brain),  so  how  can  Teacher  transfer  this  construction  to 
Student?  Below,  we  describe  a  possible  mechanism  for  solving  this  problem;  we  call  this  mechanism 
knowledge  transfer. 
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Suppose  that  Teacher  believes  in  some  theoretical  model  on  which  the  knowledge  of  Teacher  is  based.  For 
cancer  model,  he  or  she  believes  that  it  is  a  result  of  uncontrolled  multiplication  of  the  cancer  cells  (cells  of 
type  B)  which  replace  normal  cells  (cells  of  type  A).  Looking  at  a  biopsy  image,  Teacher  tries  to  generate 
privileged  information  that  reflects  his  or  her  belief  in  development  of  such  a  process;  Teacher  can  describe 
the  image  as: 

Aggressive  proliferation  of  cells  of  type  B  into  cells  of  type  A. 

If  there  are  no  signs  of  cancer  activity,  Teacher  may  use  the  description 
Absence  of  any  dynamics  in  the  standard  picture. 

In  uncertain  cases,  Teacher  may  write 

There  exist  small  clusters  of  abnormal  cells  of  unclear  origin. 

* 

In  other  words,  Teacher  uses  a  specialized  language  that  is  appropriate  for  description  x.  of  cancer 

development  employing  the  model  he  believes  in.  Using  this  language,  Teacher  supplies  Student  with 

* 

privileged  information  x.  for  the  image  x-  by  generating  training  triplets 

l  l 

*  * 

(x,,Xp>',),...,(x£,x£,y£).  (94) 

The  first  two  elements  of  these  triplets  are  descriptions  of  an  image  in  two  languages:  in  language  X  (vectors 

*  * 

Xj  in  pixel  space),  and  in  language  X  (vectors  x.  in  the  space  of  privileged  information),  developed  for 

l  l 

Teacher’s  understanding  of  cancer  model. 

Note  that  the  language  of  pixel  space  is  universal  (it  can  be  used  for  description  of  many  different  visual 
objects;  for  example,  in  the  pixel  space,  one  can  distinguish  between  male  and  female  faces),  while  the 
language  used  for  describing  privileged  information  is  very  specialized:  it  reflects  just  a  model  of  cancer 

development.  This  has  an  important  consequence:  the  set  of  admissible  functions  in  the  general  space  X  has 

* 

to  be  rich  (has  large  VC  dimension),  while  the  set  of  admissible  functions  in  the  specialized  space  X  may  be 
not  rich  (has  small  VC  dimension). 

One  can  consider  two  related  pattern  recognition  problems  using  triplets: 

1 .  The  problem  of  constructing  a  rule  y=f(x)  for  classification  of  biopsy  in  the  pixel  space  X  using  data 

(xi,yi),...,(x£,y£).  (95) 

2.  The  problem  of  constructing  a  rule  y=f  (x  )  for  classification  of  biopsy  in  the  space  X  using  data 

*  * 

(x1,y1),...,(x£,y£).  (96) 

Suppose  that  language  X  is  so  good  that  it  allows  to  create  a  rule  y=/£(x  )  that  classifies  vectors  x 
corresponding  to  vectors  x  with  higher  accuracy. 


Since  the  VC  dimension  of  the  admissible  rules  in  the  specialized  space  X 


is  much  smaller  than  the  VC 
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dimension  of  the  admissible  rules  in  the  universal  space  X  and  since  the  number  of  examples  £  is  the  same  in 

*  *  * 

both  cases,  the  bounds  on  error  rate  for  the  rule  y=f^(x  )  in  X  will  be  better  (according  to  VC  theory,  the 

guaranteed  bound  on  accuracy  of  the  chosen  rule  depends  only  on  two  factors:  frequency  of  errors  on  the 
training  set  and  VC  dimension  of  the  admissible  set  of  functions.)  than  those  for  the  rule  y=f^(x)  in  X.  That  is, 

generally  speaking,  the  classification  rule  y=f^(x  )  will  be  more  accurate  than  classification  rule  y=f^(x). 

As  a  result,  the  following  question  of  “knowledge  transfer”  arises:  how  one  can  use  the  knowledge  of  the 

^  vjj  jjj 

rule  y=/'|.  (x  )  in  space  X  to  improve  the  accuracy  of  the  desired  rule  y=f^(x)  in  space  XI  We  now  address  this 
question  when  both  described  problems  are  solved  with  neural  networks. 

Consider  three  elements  of  knowledge  representation  used  in  Artificial  Intelligence: 

1.  Fundamental  elements  of  knowledge. 

2.  Frames  (fragments)  of  the  knowledge. 

3.  Structural  connections  of  the  frames  (fragments)  in  the  knowledge. 

We  call  th e  fundamental  elements  of  the  knowledge  a  limited  number  of  elements  (functions)  in  X  that  can 
approximate  well  the  classification  rule  y=f^(x  );  then  knowledge  transfer  is  about  approximation  of  those 
fundamental  elements.  We  now  illustrate  this  concept  for  SVMs  and  neural  networks. 


In  order  to  describe  methods  of  knowledge  transfer  for  SVM,  consider  the  following  three-level  structure: 

y  j 

1 .  Level  /A :  the  input  vectors  x=(x 1 , . . .  ,xn)  EX 

y 

2.  Level  /  :  the  result  of  transformation  of  the  vectors  a:  into  vectors  z=(K(x  j  ,x), . . .  ,K(x ^ ,x))EZ 
where  K  is  the  kernel  function  for  SVM. 

Y  i  T 

3.  Level  /  :  the  linear  threshold  indicator  function  y=&(a  z(x)-h)  in  space  Z. 

Thus  the  structures  of  SVM  rules  in  spaces  X  and  X  can  be  described  as 

X:  {lX->lZ^lY)  and  A*:  (jX  ~^IZ  -^IY  ).  (97) 

To  transfer  the  knowledge  about  the  rule 


**  *  * 

y=f(x  ,a^)~b  =  X  a- A  (x-  ,x  )-b 

i=\ 


(98) 


* 

in  space  X  to  the  rule 


1  Neural  Network  with  one  hidden  layer  has  the  same  structure;  as  SVM,  it  is  a  universal  learning  machine. 
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(99) 


€ 

y=f(x,a^)~b=  X  0-iK(x-,x)-b 
i=  1 

obtained  in  space  X,  one  can  use  several  strategies.  Below  we  consider  three  of  them. 

% 

1.  A-mapping  of  privileged  information:  X  — »X.  In  this  scheme,  the  goal  is  to  transfer  information  that 

X*  *  X 

exists  in  level  I  of  SVM  in  space  X  to  level  I  of  SVM  in  space  X.  In  order  to  do  this,  one  maps 

*  *  — 
vectors  xEX  into  vectors  x  EX  by  transforming  space  X  obtaining  vectors  x  =Ax  and  then  constructs 

SVM  in  the  transform  space  X  .  Scheme  (A)  of  information  transfer  can  be  thus  described  as 

(A):  (lx* — »/x)  — >IZ — >IY  (100) 

In  this  scheme,  in  order  to  find  the  transformation  Ax  of  vectors  x=(x] ...  .,xn)rEX  into  vectors 
Ax^cpjfr),. .  ,,cp m(x))T  that  minimizes  the  functional 

R(A)=minA  f  \x  ~Ax\2p(x* ,x)dx*dx,  (101) 

we  look  for  the  minimum 

n 

R( cp)=  X  min  f  (x*k-(o(x))2p(x*k ,x)dx*kdx,  (102) 

k=  1 

where  p(x*k\x)  is  the  marginal  conditional  probability  of  coordinate  x*k  given  vector  x,  and  m  functions 
cp Ax)  are  defined  by  m  regressions 

(pk(x)=  f  x*kp(x*k\x)dx*k,  k=l,...,m.  (103) 

We  construct  approximations  to  functions  cpA(x)  by  solving  m  regression  estimation  problems  based 
on  data 

*k 

(*i  pCj),...,^  ,xe)),  k=l,...,m.  (104) 

In  order  to  find  these  approximations,  we  Structural  Risk  Minimization  principle  [19]  in  the  set  of 
functions  that  belong  to  the  Reproducing  Kernel  Hilbert  Space  (RKHS)  associated  with  some  kernel, 
that  is,  by  minimizing  the  regularized  functional 

£  *k 

R(<\ pi)=min  Z  (V-  -tp/x.)) +y<cpA(x),(pA(x)>,  k=\,...jn.  (105) 

l-  1 

The  obtained  approximations  to  the  regressions  (pA(x)  define  our  transformation.  In  this  scheme,  we 
first  transform  the  input  space  X  =AX  and  then  train  SVM  in  the  transformed  space. 

2.  B-mapping  of  privileged  information:  Z  — >X.  In  this  scheme,  the  goal  is  to  transfer  information 

7*  *  v 

that  exists  in  level  /  of  SVM  in  space  X  to  level  /  of  SVM  in  space  X.  In  order  do  this,  one  maps 

*  -  - 

vectors  xEX  to  vectors  z  E  'Z  by  transforming  space  X  and  obtaining  vectors  x  =BxE  X  and  then 
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constructs  SVM  in  the  transformed  input  space.  Scheme  (B)  of  information  transfer  can  be  thus 
described  as 


or,  in  its  simplified  form,  as 


(106) 

(107) 


The  transformation  of  input  space  in  this  scheme  is  based  on  solving  the  following  t  regression 
estimation  problems  (t  is  the  dimension  of  vector 

z*=(K{x*hx*),...,K(x*,xY)  ,  (108) 

i.e.,  the  number  of  support  vectors  in  SVM  solution  for  space  X  ):  given  data 


*  *  *  * 


(109) 


find  the  regression  functions 


cp,(x)=  f  K(xk,x*)p(x ' I x)dx* ,  k=\,...,t. 


(110) 


As  already  desrcibed  above  for  A-mapping,  one  can  find  such  approximation  in  the  RKHS  associated 
with  some  kernel  function.  The  obtained  approximations  (p  (x),. .  .,cpm(x)  define  our 

transformation:  in  general  scheme,  we  construct  SVM  rule  in  the  transformed  space;  in  simplified 
scheme,  we  construct  linear  SVM  rule  in  the  transformed  space  X  . 

7*  7 

3.  C-mapping  of  privileged  information:  r  — n.  In  this  scheme,  the  goal  is  to  transfer  information 

7  *  7 

that  exists  in  the  level  /  of  SVM  in  space  X  to  the  level  I  of  SVM  in  space  X.  In  order  to  do  this, 

one  maps  /‘-dimensional  vectors  zEZ  (/  is  the  number  of  support  vectors  of  the  SVM  rule  obtained  in 

*  * 

space  X)  into  t  -dimensional  vectors  z  EZ  (/  is  the  number  of  support  vectors  of  the  SVM  rule 
obtained  in  space  X  )  constructing  vectors  of  the  form  z  =Cz.E  Z  .  Every  coordinate  k  in  Z  space 

defines  similarity  K{x^x)  between  support  vector  and  vector  xEX,  while  every  coordinate  k  in  Z 

vjj  'J'  vjj  ^  ^ 

space  defines  similarity  K  (x^,x  )  between  support  vector  x^  and  vector  x  EX  ,  where  x  and  x  are 

* 

connected  through  p(x  lx).  Scheme  (C)  of  information  transfer  can  be  described  as 

(C):  (7-»/z*)-»  (Ill) 

Our  goal  is  to  approximate  the  similarity  function  K* (xk,x  ),k=\ between  support 

vector  xk  of  SVM  solution  in  space  X*  and  vector  xEX  using  t  similarity  functions 

Kix^x),. . .  ,K(x  ,x)  defined  by  SVM  solution  in  space  X  for  the  pairs  (x,x*)  generated  by 

*  *  *  , 

p(x  lx).  Let  x  ,. .  .,xf  be  the  support  vectors  of  SVM  solution  in  space  X  and  let  xl5. .  .,x*  be  the 

support  vectors  of  SVM  solution  in  space  X*,  where  t  and  t  are  the  numbers  of  support  vectors  in 
SVM  solutions  obtained  in  spaces  X  and  X*,  respectively.  The  SVM  rule  of  the  space  X  has  the  form 
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(112) 


t 

fix, a)=  X  o.K(x.,x)+b, 

i=\ 

and  SVM  rule  in  space  X*  has  the  form 

* 

f  (x*,a*)=  X  aiK*(xi,x)+t>\  (113) 

i=l 

In  order  to  achieve  our  goal,  we  approximate  the  functions  K*  (xk,x),k=  \ . .,/  with  the 

regression  functions 

cpA(x)=  f  K(xk,x*)p(x  \x)dx*,  k=l,...,f.  (114) 

For  each  k=l,...,t*  ,  we  construct  the  approximation  to  (p  (x)  by  using  the  data 

(K*(xk,xl),z1),...,(K*(xk,xl),zf),  k=  1,.../,  (115) 

where  z=  (KU^x.),. ...KU^x.)  EZ  j  is  ^-dimensional  vector.  Let  space 
Z  =(cp  (x),. .  .,cp  (x))  be  the  result  of  transformation  of  space  Z.  After  that,  we  construct 

linear  SVM  in  space  Z  . 

One  can  construct  many  different  schemes  of  knowledge  transformation  from  space  X*  to  space  X  (as  well 
as  schemes  of  combining  knowledge  existing  in  both  spaces)  based  on  described  approaches. 

In  particular,  in  all  three  described  mappings  A-C,  one  may  also  concatenate  the  constructed  knowledge 
transferred  features  with  those  already  available  from  space  X  and  solve  SVM  on  this  augmented  set; 
constructed  knowledge  transferred  features  could  be  subject  to  feature  selection  in  order  to  improve  the 
classification  performance;  for  C-mapping,  one  could  construct  regression  functions  only  to  those  functions 

jj.  H4  jj, 

K  (xj,x  )  that  correspond  to  “significant”  weights  a.;  if  linear  regression  functions  are  used  for  C-mapping, 

their  positive  versions  could  be  explored  as  more  relevant,  etc.  Note  that  C-mapping  requires  executing  two 
versions  of  SVM:  one  for  standard  space,  and  one  for  privileged  one. 

Knowledge  transfer  in  Neural  Networks  is  analogous  to  the  one  used  for  knowledge  transfer  in  SVMs.  As  in 
the  case  of  SVM  described  above,  one  constructs  and  trains  two  neural  networks:  one  network  in  space  X  and 
another  network  in  space  X .  To  simplify  the  notations,  we  assume  that  both  networks  have  the  same 
architecture  containing  s  layers.  Let  input  vector  xEX  define  the  first  layer  Ix( 0)  of  neural  network  in  space  A; 
this  vector  is  transferred  into  vector  z!EZ( 1)  in  the  next  layer  of  the  trained  network,  and  layers 

Iz(k),  A=2, ,.s —  I  provide  subsequent  transformations  zkEZ(k).  As  in  SVM,  the  last  layer 

is  the  linear  indicator  function  y=&i(as  (or  its  sigmoid  approximation).  The  structure  of  Neural  Network 
in  space  X  is 

X:  7Z(0)  -^/z(l)^ - >  IY.  (116) 

and  the  structure  of  Neural  Network  in  space  X  is 

X*:  f  (0)  -^/z  (1)— > - *  IY.  (117) 
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The  simple  scheme  of  knowledge  transfer  from  network  in  space  X*  to  network  in  space  X  can  be 
described  as  follows:  information  accumulated  into  first  k  layers  of  network  trained  in  space  X  is 
transferred  into  m-th  layer  of  network  in  space  X: 

(/Z*(0)— ► - *IZ\k) )  — ►  (/z(0)  — ► - >7z(m))  — >7(m+ 1 )  — > - >IY,  (118) 

which  forms  the  operator  z,(k)  =Az,(m)  that  transforms  vectors  z(m)  from  neural  network  in  space  X  into 
vectors  z*(k)  of  neural  network  in  space  X*. 

The  new  neural  network  contains  three  parts: 

1.  The  first  part  of  the  network  contains  first  m  layers  of  trained  network  in  space  X;  we  denote  it 
N(Qjn).  This  network  performs  transformation  z(m)=N(0,m)x(0). 

2.  The  second  part  of  the  network  contains  operator  A  that  transforms  vectors  z,(m)  in  vectors 

z(k)  =Az(m). 

3.  The  third  part  of  the  networks  is  the  part  of  the  network  in  space  X  starting  from  level  (k+l),  free 

?Jc  - 

parameters  of  which  have  to  be  learned;  we  denote  it  N  ( k,s )  .  Vectors  z(k)  are  the  input  of  this 
part  of  network,  and  classifiers  are  the  output. 


The  scheme  of  such  combined  networks  is 

A  {N(0,m)  ~^*(k,s),  (119) 

where  N(0,m)  is  fixed  (does  not  have  free  parameters),  while  N*(k,s)  contains  free  parameters.  Therefore 
operator  A  transforms  knowledge  about  neural  network  in  X  . 

In  order  to  find  this  operator  based  on  two  trained  networks,  one  uses  the  same  techniques  of  regression 
estimation  as  in  the  case  of  SVM.  Let  z*(k)=(z*l(k)....,z*s  ( k ))  be  vectors  produced  on  the  level  lz  ( k )  by 

the  network  trained  in  space  X*,  and  let  z(m)=(z](m)....,Zs(m))  be  vectors  produced  on  the  level  Iz(k)  by 
the  network  trained  in  space  X. 


Consider  pairs 


from  the  training  triplets.  Let 


*  * 


be  vectors  produced  by  m-th  layer  of  neural  networks  corresponding  to  vectors  x  and  let 

*  * 

zi(k),...,zi(k) 


be  vectors 

Zi(k)=(zi\k),...,z*s  (k)) 


(120) 

(121) 

(122) 

(123) 
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produced  by  k- th  layer  of  neural  networks  corresponding  to  vectors  x*.  In  order  to  construct  mapping 
operator  A  as  in  S  VM  case,  we  estimate  5*  regression  functions  x*‘(k)=(p  (x(m))  using  data 

(M  ,xl(m)),...,(xt  ,xt(m)),  t=  1,. . .,5  .  (124) 

Therefore  operator  A  transforms  vectors  x(m)  into  vectors 

Ar(m)=((p  (x(m), . . .  ,(p  (125) 

For  neural  network  that  contains  more  than  one  hidden  layer,  one  can  transfer  knowledge  from  network  in 
X*  using  more  than  one  operator  A.,  /=  1 , . . .  ,p  by  sequentially  constructing  several  transformations 

between  different  layers  of  network  in  X*. 


Thus  we  described  three  key  approaches  for  mapping  of  privileged  information  for  knowledge  transfer.  In 
this  section,  we  present  scalable  algorithms  for  one  of  them,  namely  A-mapping,  based  on  multivariate 
regressions  of  privileged  features  as  functions  of  decision  variables;  we  also  illustrate  the  algorithms’ 
performance  and  their  properties  on  several  examples. 

In  order  to  illustrate  this  version  of  knowledge  transfer  LUPI,  we  explored  the  synthetic  dataset  derived 
from  dataset  “Parkinsons”  in  UCI  Machine  Learning  Repository.  Since  none  of  22  features  of  “Parkinsons” 
dataset  is  privileged,  we  created  several  artificial  scenarios  emulating  the  presence  of  privileged  information 
in  that  dataset.  Specifically,  we  ordered  “Parkinsons”  features  according  to  the  values  of  their  mutual 
information  (with  first  features  having  the  lowest  mutual  information,  while  the  last  features  having  the  largest 
one).  Then,  for  several  values  of  parameter  k,  we  treated  the  last  k  features  as  privileged  ones,  while  first  22- k 
features  being  treated  as  decision  ones.  Since  our  ordering  was  based  on  mutual  information,  these 
experiments  corresponded  to  privileged  spaces  of  various  dimensions  and  various  relevance  levels  for 
classification.  For  each  considered  value  of  k,  we  generated  20  pairs  of  training  and  test  subsets,  containing, 
respectively  75%  and  25%  of  elements  of  the  “Parkinsons”  dataset.  For  each  of  these  pairs,  we  considered 
the  following  four  types  of  classification  scenarios  for  both  SVM  (with  RBF  kernel)  and  ANN  algorithms: 

1.  SVM  and  ANN  on  22 —k  decision  features; 

2.  Knowledge  transfer  LUPI  (linear)  based  on  constructing  k  multiple  linear  regressions  from  22 —k 
decision  features  to  each  of  k  privileged  ones,  replacing  the  corresponding  values  in  privileged  vectors 
with  their  regressed  approximations,  and  then  training  SVM  and  ANN  on  the  augmented  dataset 
consisting  of  22  features; 

3.  Knowledge  transfer  LUPI  (non-linear)  based  on  constructing  k  non-linear  (in  the  class  of  RBF 
functions)  regressions  from  22 —k  decision  features  to  each  of  k  privileged  ones,  replacing  the 
corresponding  values  in  privileged  vectors  with  their  regressed  approximations,  and  then  training  SVM 
and  ANN  on  the  augmented  dataset  consisting  of  22  features; 

4.  SVM  and  ANN  on  all  22  features. 


For  each  scenario,  the  algorithms  were  trained  in  the  following  way: 
SVM.  Two  parameters  for  RBF  kernels,  namely  SVM  penalty  parameter  C  and  RBF  kernel  parameter  y,  were 
selected  using  6-fold  cross-validation  error  rate  over  the  two-dimensional  grid  of  both  parameters  C  and  y.  In 
that  grid,  log2(C)  ranged  of  from  -5  to  +5  with  step  0.5,  and  log2(y)  ranged  +6  to  -6  with  step  0.5  (thus  the 
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whole  grid  consisted  of  21x25=525  pairs  of  tested  parameters  C  and  y). 
ANN.  Neural  networks  were  trained  using  Mathworks™  Matlab  Neural  Network  Toolbox™  with  the  same 
default  parameters  such  as  using  hyperbolic  tangent  sigmoid  as  activation  function,  applying  Levenberg- 
Marquardt  backpropagation  training  algorithm  and  selecting  the  ratio  for  training:validation:test  as  70:15:15 
for  early  stopping  on  cross-entropy,  etc.  For  each  V-dimensional  input,  the  architecture  of  ANN  was 
selected  [with  several  hidden  layers  (from  one  to  five)  with  the  number  of  neurons  in  it  ranging  from  5  to  100 
(a  separate  ANN  was  trained  for  each  of  these  architecture  choices  final  architecture  was  then  selected  based 
on  the  best  performance).  Note  that  we  do  not  claim  that  these  particular  architecture  choices  for  SVM  and 
ANN  are  optimal;  our  point  is  to  demonstrate  the  significant  potential  of  LUPI  improvement  with  different 
classification  methods,  whether  these  methods  are  optimal  or  not. 

The  averaged  (over  20  realizations)  error  rates  for  these  scenarios  are  shown  in  Table  6  (for  SVM)  and  in 
Table  7  (for  ANN).  The  collected  results  show  that  performance  of  SVM  (and  its  LUPI  modifications)  is 
better  than  that  of  ANN  (and  its  LUPI  modifications).  They  also  show  that  both  linear  and  nonlinear  versions 
of  Knowledge  Transfer  LUPI  improve  the  performance  of  SVM  and  ANN  on  decision  inputs  (often 
significantly,  in  relative  terms)  in  all  of  the  considered  scenarios.  Note  that  both  versions  are  just  examples  of 
knowledge  transfer  and  other  mappings  (especially  if  relevant  domain  knowledge  is  available)  could  be 
leveraged. 
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Table  6.  Performance  of  SVM  and  LUPI  on  Modified  “Parkinsons”  Example. 


k 

SVM  on 
decision  features 

LUPI 

(linear) 

LUPI 

(nonlinear) 

SVM  on 
all  features 

LUPI  gain 
(linear) 

LUPI  gain 
(nonlinear) 

12 

13.26% 

9.18% 

11.32% 

6.63% 

61.55% 

29.25% 

11 

13.52% 

10.66% 

12.70% 

6.63% 

41.49% 

11.85% 

10 

13.16% 

10.00% 

12.19% 

6.63% 

48.45% 

14.85% 

9 

12.70% 

8.67% 

10.76% 

6.63% 

66.41% 

31.95% 

8 

12.81% 

8.52% 

10.76% 

6.63% 

69.44% 

33.07% 

7 

14.49% 

11.07% 

13.16% 

6.63% 

43.51% 

16.88% 

6 

13.78% 

11.17% 

12.35% 

6.63% 

36.43% 

20.01% 

5 

10.56% 

8.98% 

9.49% 

6.63% 

40.27% 

27.28% 

4 

11.22% 

10.36% 

10.10% 

6.63% 

18.88% 

24.42% 

3 

12.04% 

9.59% 

9.44% 

6.63% 

45.28% 

48.13% 

2 

8.47% 

7.55% 

7.04% 

6.63% 

49.99% 

77.76% 

Table  7.  Performance  of  ANN  and  LUPI  on  Modified  “Parkinsons”  Example. 


K 

ANN  on 

LUPI 

LUPI 

ANn  on 

LUPI 

LUPI  gain 

decision  features 

(linear) 

(nonlinear) 

all  features 

gain 

(linear) 

(nonlinear) 

12 

19.49% 

16.43% 

15.46% 

8.01% 

26.66% 

35.11% 

11 

19.44% 

15.20% 

15.56% 

8.01% 

37.05% 

33.93% 

10 

21.33% 

14.64% 

15.66% 

8.01% 

50.18% 

42.52% 

9 

20.66% 

12.70% 

13.72% 

8.01% 

62.90% 

54.83% 

8 

20.26% 

12.04% 

13.98% 

8.01% 

67.08% 

51.25% 

7 

18.57% 

13.01% 

15.05% 

8.01% 

52.65% 

33.33% 

6 

20.20% 

13.83% 

13.93% 

8.01% 

52.29% 

51.45% 

5 

16.84% 

11.63% 

11.27% 

8.01% 

58.96% 

63.00% 

4 

17.35% 

12.45% 

12.50% 

8.01% 

52.46% 

51.91% 

3 

12.14% 

11.48% 

11.48% 

8.01% 

16.05% 

16.05% 

2 

10.97% 

10.25% 

10.25% 

8.01% 

24.13% 

24.15% 

Numerically,  the  error  rates  of  LUPI  are  between  the  corresponding  SVM  or  ANN  constructed  on  decision 
features  and  on  all  features.  In  other  words,  if  the  error  rate  of  the  algorithm  on  decision  features  is  B ,  while 
the  error  rate  of  the  algorithm  on  all  features  is  C,  the  error  rate  A  of  LUPI  satisfies  the  bounds  C<A<B.  So 
one  can  evaluate  the  efficiency  of  LUPI  approach  by  computing  the  metric  ( B-A)/{B~C ),  which  describes 
how  much  of  the  performance  gap  B—C  can  be  recovered  by  LUPI.  For  SVM,  this  metric  varies  between  12% 
and  78%;  for  ANN,  this  metric  varies  between  16%  and  67%.  Generally,  in  realistic  examples,  the  typical 
value  for  this  LUPI  efficiency  metric  is  in  the  ballpark  of  35%.  Also  note  that  if  the  gap  B—C  is  small  compared 
to  C,  it  means  that  the  privileged  information  is  not  particularly  relevant;  in  that  case,  it  is  likely  hopeless  to 
apply  LUPI  anyway:  there  is  little  space  for  improvement  for  that.  It  is  probably  safe  to  start  looking  for  LUPI 
solution  if  the  gap  B—C  is  at  least  1.5-2  times  larger  than  C. 


We  have  also  implemented  C-mapping  for  SVM  for  the  already  described  datasets  using  the  same  setting 

as  for  A-mapping,  with  the  following  modifications  instead  of  constructing  regressions  to  privileged  features, 

*  *  * 

we  constructed  (positive  linear  or  nonlinear  kernel)  regressions  to  functions  K  (xt  ,x  )  with  subsequent 
selection  of  top  40  or  them,  in  terms  of  their  relevance  to  the  label,  as  was  determined  by  RandomForest 
method. 
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The  averaged  (over  20  realizations)  error  rates  for  these  scenarios  are  shown  in  Table  8.  The  collected 
results  show  that  both  linear  and  nonlinear  versions  of  Knowledge  Transfer  LUPI  with  C-mapping  improve 
the  performance  of  SVM  on  decision  inputs  (often  significantly,  in  relative  terms)  in  all  of  the  considered 
scenarios. 


Table  8.  Performance  of  SVM  and  LUPI  on  Modified  “Parkinsons”  Example. 


k 

SVM  on 

LUPI 

LUPI 

SVM  on 

LUPI 

LUPI  gain 

decision 

(linear) 

(nonlinear 

all  features 

gain 

(linear) 

(nonlinear 

features 

) 

) 

12 

13.26% 

9.59% 

9.59% 

6.63% 

53.35% 

55.35% 

11 

13.52% 

9.79% 

10.10% 

6.63% 

54.14% 

49.64% 

10 

13.16% 

11.22% 

9.79% 

6.63% 

29.71% 

51.61% 

9 

12.70% 

10.30% 

10.51% 

6.63% 

39.54% 

36.08% 

8 

12.81% 

10.20% 

10.20% 

6.63% 

42.23% 

42.23% 

7 

14.49% 

9.49% 

11.53% 

6.63% 

63.61% 

37.66% 

6 

13.78% 

10.71% 

11.94% 

6.63% 

42.94% 

25.73% 

5 

10.56% 

9.18% 

10.20% 

6.63% 

35.11% 

9.16% 

4 

11.22% 

8.26% 

10.30% 

6.63% 

64.49% 

20.04% 

3 

12.04% 

9.08% 

10.41% 

6.63% 

54.71% 

30.13% 

2 

8.47% 

7.57% 

8.18% 

6.63% 

48.91% 

15.76% 

In  this  paper,  we  described  several  properties  of  privileged  information  including  its  role  in  machine 
learning,  its  structure,  and  its  applications.  We  extended  the  existing  knowledge  transfer  research  in  the  area 
of  privileged  information  (initially  considered  for  SVM)  to  neural  networks  and  presented  a  scalable 
algorithmic  framework,  which  has  the  same  scalability  properties  as  current  implementations.  The  described 
framework  is  the  first  step  in  the  proposed  direction,  and  its  further  improvements  (especially  concerning 
alternative  methods  of  knowledge  transfer)  will  be  the  subject  of  future  work. 

3.3  Software  for  LUPI 

The  current  distribution  software  for  knowledge  transfer  LUPI  consists  of  the  files 

1)  priv_predict.py 

2)  std_predict.py 

3)  lupi_predict.py 

4)  SVMstd.py 

5)  SVMpriv.py 

6)  SVMlupi.py 

7)  test_error.py 

8)  partition.py 

9)  experiment.py 

10)  mamiStd.py 

11) mamiLupi.py 
and  the  folders 

1)  data 

2)  models 
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In  order  to  run  this  software,  Anaconda  has  to  be  installed,  which  could  be  done  via 
https://docs.continuum.io/anaconda/install.  Once  Anaconda  is  installed,  the  scripts  could  be  used  as  follows. 
First,  train  and  test  datasets  can  be  created  using  the  partition.py  script: 

python  partition.py 

which  will  create  train  and  test  datasets  for  the  features  X  and  the  labels  y.  Options  of  the  script  can  be  used 
to  choose  the  features  and  label  files,  the  test  size  and  other  parameter.  If  no  options  are  passed  to  the  script, 
default  options  are  used  and  both  files  are  saved  in  ./data/partitions/.  Standard  SVM  is  trained  by  typing  the 
command 

python  SVMstd.py 

This  command  will  save  several  files  in  the  ./models/  directory;  these  files  are  the  models  that  are  needed  for 
prediction.  After  that,  labels  for  the  test  file  (saved  as  ./data/partitions/X_test.data)  can  be  obtained  by  typing 
the  command 

python  std_predict.py 

which  will  create  a  file  in  ./prediction  with  the  predicted  labels,  saved  as  prediction.data.  The  prediction  error 
then  can  be  computed  by  comparison  to  the  real  labels  (which  are  saved  as./data/partitions/y_test.data)  by 
typing  the  command: 

python  test_error.py 

The  same  procedure  can  be  repeated  for  Privileged  SVM  as 
python  partition.py  — featFile  ./data/priv.data 
python  SVMpriv.py 
python  priv_predict.py 
python  test_error.py 

Finally,  knowledge  transfer  LUPI  SVM  can  be  executed  as  follows  (assuming,  for  this  example,  that  the  first 
14  features  in  the  dataset  are  standard,  while  other  features  are  privileged): 

python  partition.py  —featFile  ./data/priv.data  — lupi  14 
python  SVMlupi.py 
python  lupi_predict.py 
python  test_error.py 

The  options  of  the  described  scripts  are  as  follows. 

partition.py 

•  featFile:  path  to  the  features  file.  This  file  should  contain  only  the  features,  not  the  labels.  If  the  file  is 
intended  to  be  used  with  LUPI,  then  the  standard  features  should  appear  first  followed  by  the 
privileged  ones.  Default:  ./data/std.data 

•  labelFile:  path  to  the  label  file.  It  is  assumed  that  the  n-th  line  contains  the  label  of  the  n-th  line  of 
features  in  the  featFile.  The  labels  are  typically  represented  by  1  and  0  but  in  general  any  two  integers 
can  be  used  instead.  Default:  ./data/labels. data 
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•  testSize:  defines  the  integer  size  of  the  test  set.  For  example,  if  the  number  25  is  passed,  it  will  save 
25  random  data  points  for  the  test  dataset  and  put  the  remaining  data  points  into  the  training  set. 
Default  allocates  25%  of  the  data  to  the  test  set. 

•  seed:  random  seed  to  decide  the  train/test  split,  it  should  be  an  integer.  Default  is  0. 

•  outputFile:  path  to  the  directory  in  which  the  train/test  datasets  will  be  saved.  Default  is 
./data/partitions 

•  lupi:  used  to  indicate  that  the  data  are  intended  for  LUPI  approach,  in  which  case  the  train  set  will 
contain  both  standard  and  privileged  features,  but  the  test  set  will  contain  only  standard  ones.  If  used 
it  expects  an  integer  to  indicate  how  many  standard  features  there  are  (which  are  assumed  to  appear 
first  in  the  featFile).  For  example,  it  the  number  5  is  passed,  it  assumes  that  the  data  folder  has  at  least 
6  columns  and  the  first  5  ones  are  standard.  Default  is  not  to  use  LUPI  approach. 

SVMstd.py 

•  nJobs:  number  of  processors  to  be  used.  Expects  a  positive  integer  and  assumes  the  machine  has  that 
many  processors.  Default  is  1. 

•  seed:  random  seed  to  decide  the  cross-validation  split,  it  should  be  an  integer.  Default  is  0. 

•  kernel:  the  choice  of  SVM  kernel.  It  can  be  either  'linear’  or  'rbf .  When  passed  as  an  argument  the 
corresponding  word  has  to  be  present,  e.g.  —kernel  linear.  Default  is  ‘rbf’. 

•  C  range:  defines  the  range  of  values  for  the  penalty  parameter  C  of  the  SVM;  C  needs  to  be  a  positive 
number.  This  option  allows  the  user  to  pass  3  float  numbers:  minimum  C  ,  maximum  C  and  step. 
These  will  be  used  to  construct  the  following  interval:  [2A(minC),  2A(maxC)]  in  steps  of  2A(step),  not 
including  2A(maxC).  To  pass  in  the  values  separate  them  by  space,  e.g.  — C_range  1  2  0.1  ,  Default: 
-5  5.5  .5^ 

•  gamma  range:  defines  the  range  of  values  for  the  gamma  parameter  of  the  SVM,  when  rbf  kernel  is 
selected.  This  is  the  width  of  the  kernel  function.  Typically  one  gives  to  the  machine  a  range  of  values 
that  gamma  can  take  and  the  machine  does  cross-validation  to  figure  out  which  value  has  the  best 
generalization  performance.  This  option  allows  the  user  to  pass  3  float  numbers:  minimum  gamma  , 
maximum  gamma  and  step.  These  will  be  used  to  construct  the  following  interval:  [1/  [2 
[2A(min_gamma)]A2,  1/  [2  [2A(max_gamma)]A2]  in  steps  of  1/  [2  [2A(step)]A2],  not  including  1/  [2 
[2A(max_gamma)]A2].  To  pass  in  the  values  separate  them  by  space,  e.g.  — gamma_range  1  2  0.1  , 
Default:  -6  5.5  .5 

SVMpriv.py  -  same  as  SVMstd.py 

SVMlupi.py  -  same  as  SVMstd.py  plus  the  following  options: 

•  lupiRegr:  the  kind  of  regression  for  reconstructing  the  privileged  features  at  test  time.  This  can  be  any 
of  ‘linear,  ridge,  svr’,  where  linear  is  standard  linear  regression,  ridge  is  a  kernel  ridge  non-linear 
regression,  and  svr  is  a  support  vector  regression  (also  non-linear).  Both  non-linear  regressions  need 
to  find  optimal  parameters  during  training  and  hence  take  a  longer  time  to  train.  Both  use  an  rbf  kernel. 
When  passing  an  argument,  the  corresponding  word  has  to  be  typed,  e.g.  —lupiRegr  linear.  Default: 
ridge 

•  nStdFeat:  number  of  standard  features  present,  it  expects  a  positive  integer.  Default:  14 

std_predict.py  -  saves  the  prediction  at  ./predictions/prediction.data 
priv_predict.py  -  same  as  std_predict.py 
lupi_predict.py  -  same  as  std_predict.py 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 

40 


For  the  same  MAMI  set  described  above,  we  now  describe  how  to  compare  LUPI  SVM  versus  Standard 
SVM.  Using  the  contents  of  the  folder  we  can  run  side  by  side  SVM  standard  and  LUPI  SVM  sufficiently 
many  times  to  achieve  reliable  results  when  we  aggregate  our  findings.  For  this  we  provide  50  partitions  of 
the  MAMI  dataset  and  two  scripts  that  can  automate  the  experiment.  We  would  like  to  see  the  difference 
between  Standard  and  LUPI  as  we  vary  the  size  of  the  training  sample.  Hence  we  include  in  our  folder  4 
subfolders  named  mami80,  mamil60,  mami320,  and  mami640.  Each  contains  data  partitioned  in  train  and 
test.  Mami80  contains  50  random  pairs  of  train/test  sets,  where  the  train  sets  have  size  54  and  test  sets  have 
size  802.  These  numbers  are  determined  by  the  maximum  size  of  data  with  balanced  class  labels  if  we  start 
with  80  (or  160/320/640)  random  samples.  The  remaining  data  points  are  put  in  the  test  set.  Similarly 
mamil60  has  a  108/748  split,  mami320  has  a  216/640  split,  and  mami640  has  a  432/424  split. 

These  ready  partitioned  data  allow  us  to  run  experiments  and  compare  Standard  and  LUPI  SVM  for  different 
training  sizes  of  54,  108,  216  and  432  data  points. 

To  run  these  experiments,  we  used  the  mamiStd.py  script  as  follows,  to  see  the  performance  of  the  SVM 
Standard  algorithm: 

python  mamiStd.py  — nJobs  2  —  trainSize  80 

where  nJobs  tell  the  computer  to  use  2  processors  and  trainSize  tells  it  to  run  the 
Similarly  we  can  run  the  experiments  for  the  rest  of  the  train  size.  The  output  of  the 
test  error  across  the  50  experiments. 

To  run  the  corresponding  experiments  for  LUPI  use  mamiLupi.py  script  as  follows: 

python  mamiLupi.py  —nJobs  2  —trainSize  80 

where  nJobs  tell  the  computer  to  use  2  processors  and  trainSize  tells  it  to  run  the 
Similarly  we  can  run  the  experiments  for  the  rest  of  the  train  size.  The  output  of  the 
test  error  across  the  50  experiments. 

We  have  provided  a  special  script  that  can  automate  the  different  steps  of  a  full  train/test  cycle.  The 
experiment.py  script  can  be  used  to  run  a  number  of  experiments  on  any  dataset  and  get  the  average  test  error. 
Such  repetitive  experiments  are  necessary  in  order  to  get  better  approximation  of  the  true  test  error  for  a 
learner.  Below  we  explain  the  option  features  for  the  experiment  script. 

experiment.py 

•  featFile:  path  to  the  features  file.  This  should  contain  only  the  features,  not  the  labels.  If  it’s  intended 
to  be  used  with  LUPI,  then  the  standard  features  should  appear  first  followed  by  the  privileged. 
Default:  ./data/std.data 

•  labelFile:  path  to  label  file.  It’s  assumed  that  the  n-th  line  contains  the  label  of  the  n-th  line  of  features 
in  the  featFile.  The  labels  are  typically  represented  by  1  and  0  but  in  general  any  two  integers  can  be 
used  instead.  Default:  ./data/labels. data 

•  testSize:  defines  the  size  of  the  test  set,  as  an  integer.  If  the  number  25  is  passed,  it  will  save  25  random 
data  points  for  test  and  put  the  rest  of  the  data  points  in  the  train  set.  Default:  gives  25%  of  the  data  to 
the  test  set 


experiment  for  size  52. 
algorithm  is  the  average 


experiment  for  size  52. 
algorithm  is  the  average 
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•  lupi:  used  to  indicate  that  the  data  are  intended  for  a  LUPI  procedure,  in  which  case  the  train  set  will 
contain  standard  and  privileged  features,  but  the  test  will  contain  only  standard.  If  used  it  expects  an 
integer  to  indicate  how  many  standard  features  there  are  (which  are  assumed  to  appear  first  in  the 
featFile).  If  the  number  5  is  passed,  it  is  assumed  that  the  data  folder  has  at  least  6  columns  and  the 
first  5  are  standard.  Default:  assumes  lupi  is  not  being  done. 

•  nJobs:  number  of  processors  to  be  used.  Expects  a  positive  integer  and  assume  the  machine  has  that 
many  processors.  Default:  1 

•  kernel:  the  choice  of  S VM  kernel.  It  can  be  either  'linear’  or  'rbf .  When  passed  as  an  argument,  the 
corresponding  word  has  to  be  type,  e.g.  —kernel  linear.  Default:  rbf 

•  C  range:  defines  the  range  of  values  for  the  C  parameter  of  the  SVM.  This  is  the  tolerance  of  error 
when  constructing  the  classification  boundary.  C  needs  to  be  a  positive  number,  but  typically  one 
gives  to  the  machine  a  range  of  values  that  C  can  take  and  the  machine  does  cross-validation  to  figure 
out  which  value  has  the  best  generalization  performance.  This  option  allows  the  user  to  pass  3  float 
numbers:  minimum  C  ,  maximum  C  and  step.  These  will  be  used  to  construct  the  following  interval: 
[2A(minC),  2A(maxC)]  in  steps  of  2A(step),  not  including  2A(maxC).  To  pass  in  the  values  separate 
them  by  space,  e.g.  —C_range  1  2  0.1 ,  Default:  -5  5.5  .5 

•  gamma  range:  defines  the  range  of  values  for  the  gamma  parameter  of  the  SVM,  when  rbf  kernel  is 
selected.  This  is  the  width  of  the  kernel  function.  Typically  one  gives  to  the  machine  a  range  of  values 
that  gamma  can  take  and  the  machine  does  cross-validation  to  figure  out  which  value  has  the  best 
generalization  performance.  This  option  allows  the  user  to  pass  3  float  numbers:  minimum  gamma  , 
maximum  gamma  and  step.  These  will  be  used  to  construct  the  following  interval:  [1/  [2 
[2A(min_gamma)]A2,  1/  [2  [2A(max_gamma)]A2]  in  steps  of  1/  [2  [2A(step)]A2],  not  including  1/  [2 
[2A(max_gamma)]A2].  To  pass  in  the  values  separate  them  by  space,  e.g.  —gamma_range  1  2  0.1  , 
Default:  -6  5.5  .5 

•  lupiRegr:  (  used  only  together  with  lupi  option)  the  kind  of  regression  for  reconstructing  the  privileged 
features  at  test  time.  This  can  be  any  of  ‘linear,  ridge,  svr’,  where  linear  is  standard  linear  regression, 
ridge  is  a  kernel  ridge  non-linear  regression,  and  svr  is  a  support  vector  regression  (also  non-linear). 
Both  non-linear  regressions  need  to  find  optimal  parameters  during  training  and  hence  take  a  longer 
time  to  train.  Both  use  an  rbf  kernel.  When  passed  as  an  argument,  the  corresponding  word  has  to  be 
typed,  e.g.  —lupiRegr  linear.  Default:  ridge 

Here  we  explain  how  to  use  this  software  on  a  different  dataset.  The  most  important  thing  is  to  save  the  data 
in  the  appropriate  format.  Then  direct  the  scripts  have  to  be  directed  to  the  saved  datasets.  The  format  of  the 
data  has  to  be  following: 

Features  file  -  a  csv  file  that  contains  columns  of  data,  where  each  column  is  a  feature.  If  some  of  these 
features  are  designated  as  Standard  features,  for  LUPI  purposes,  then  these  Standard  features  should  appear 
first,  followed  by  the  rest  of  the  features,  which  will  be  assumed  to  be  privileged.  For  example  we  have 
provided  two  different  such  files  in  the  data  folder  and.  We  have  saved  as  priv.data  all  the  mami  dataset  with 
the  first  14  features  being  the  standard  features,  in  ./data/priv.data.  We  have  also  saved  just  the  standard 
features  as  a  separate  folder  in  ./data/std.data 

Labels  file:  a  csv  that  contains  line  separated  integers  that  represent  the  class  labels  of  each  examples  from 
the  features  file  (in  the  same  order). 

Once  the  two  files  are  saved,  say  features. data  and  labels. data  he  partition.py  script  has  to  be  pointed  to  these 
two  files  to  be  split  into  train  and  test  set  as  follows: 
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python  partition.py  — featFile  <path  to  features. data>  — labelFile  <path  to  labels.data> 

More  options  can  be  included,  such  as  the  test  size  or  whether  the  data  are  supposed  to  be  used  with  LUPI  (in 
which  case,  the  option  — lupi  cnumber  of  standard  features>  should  be  used).  Once  train  and  test  partitions 
are  saved,  SVMs  can  be  trained  on  these  data  using  the  same  instructions  as  before. 


4  RESULTS  AND  DISCUSSIONS 

As  described  in  the  previous  section,  we  proposed  and  developed  a  novel  framework  for  most  accurate 
computation  of  key  statistical  elements  of  model-driven  problems  (such  as  conditional  probability,  regression, 
etc.).  The  proposed  rigorous  approach  avoids  the  constraints  of  traditional  model-driven  approaches;  instead, 
the  problem  of  computing  the  corresponding  statistical  quantities  is  formulated  according  to  its  definition, 
which  involves  the  corresponding  Fredholm  integral  equation.  This  ill-posed  equation,  upon  replacing 
unknown  quantities  with  their  data-driven  empirical  approximations,  are  converted  to  the  corresponding 
quadratic  programming  problems,  which  could  be  solved  with  standard  solvers.  Additional  domain 
knowledge  components  could  be  encoded  as  constraints  for  these  quadratic  programming  problems. 

This  framework  was  developed  in  the  course  of  PPAML  program  and  described  in  detail  in  our  publications 
[1]  [2]  [3].  The  results,  described  in  these  papers,  demonstrate  the  viability  of  the  proposed  approach  and 
superior  performance  as  compared  to  classical  methods. 

For  more  narrow,  but  very  important  class  of  problems,  where  the  assumption  of  the  monotonicity  of  the 
results  is  essential  (a  typical  classifier,  such  as  SVM  or  ANN,  satisfies  this  assumption),  additional  Synergy 
methodology  was  developed.  This  methodology  is  based  on  a  rigorous  approach  of  merging  results  of 
different  decisions  (classifiers)  in  the  most  accurate  way.  It  was  developed  in  the  course  of  PPAML  program 
and  described  in  detail  in  [13].  It  is  applicable  to  ensemble  methods  (merging  outputs  of  different  classifiers), 
bagging  (merging  outputs  of  classifiers  trained  on  different  subsets  of  the  original  training  set),  parameter 
selection  (merging  outputs  of  classifiers  trained  on  different  areas  of  the  parameter  space),  etc.  Performance 
improvements,  achieved  by  the  propose  mechanisms,  have  been  varying  up  to  35%  improvement  of  accuracy 
over  SoA  (standard  ensemble  methods). 

As  described  in  the  previous  section,  we  developed  a  solid  and  scalable  mechanism  for  capturing  domain 
knowledge  in  the  form  of  features  and  kernels  for  standard  data-driven  problems.  This  mechanism  was 
implemented  for  learning  using  privileged  information.  The  implemented  techniques  for  encoding  model- 
based  information  into  features  with  improved  performance  by  40%  over  SoA  (standard  SVM  and  neural 
networks).  We  have  also  developed  LUPI  implementation  in  Python  and  released  it  as  open  source  code. 

More  details  about  the  outlined  methods  and  approaches  can  be  found  in  the  following  nine  publications  that 
were  published  based  on  results  of  our  research  in  DARPA  PPAML  program: 

(1)  V.Vapnik,  I.Braga,  R.Izmailov,  Constructive  Setting  for  Problems  of  Density  Ratio  Estimation, 
Statistical  Analysis  and  Data  Mining,  vol.  8,  no.  3,  June  2015,  pp.  137-146. 

(2)  V.Vapnik,  R.Izmailov,  Statistical  Inference  Problems  and  Their  Rigorous  Solutions,  in  Statistical 
Learning  and  Data  Sciences,  A.Gammerman,  V.Vovk,  H.Papadopulos  (Eds).  Lecture  Notes  in 
Artificial  Intelligence  9047.  Proceedings  of  Third  International  Symposium,  SLDS.  London, 
Springer,  2015,  pp. 33-71. 
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(3)  V.Vapnik,  R. Izmailov,  V-Matrix  Method  of  Solving  Statistical  Inference  Problems,  Journal  of 
Machine  Learning  Research,  16:1683-1730,  2015. 

(4)  V.Vapnik,  R. Izmailov,  Synergy  of  Monotonic  Rules,  Journal  of  Machine  Learning  Research,  17:1- 
33,2016. 

(5)  V.Vapnik,  R. Izmailov,  Learning  Using  Privileged  Information:  Similarity  Control  and  Knowledge 
Transfer,  Journal  of  Machine  Learning  Research,  16:2023-2049,  2015. 

(6)  V.Vapnik,  R. Izmailov,  Learning  with  Intelligent  Teacher:  Similarity  Control  and  Knowledge 
Transfer,  in  Statistical  Learning  and  Data  Sciences,  A.Gammerman,  V.Vovk,  H.Papadopulos  (Eds). 
Lecture  Notes  in  Artificial  Intelligence  9047.  Proceedings  of  Third  International  Symposium,  SLDS. 
London,  Springer,  2015,  pp.3-32. 

(7)  V.Vapnik,  R. Izmailov,  Learning  with  Intelligent  Teacher,  in  Lecture  Notes  in  Artificial  Intelligence 
9653.  Proceedings  of  5th  International  Symposium,  COPA  2016.  Springer,  2016. 

(8)  R.Ilin,  R.Izmailov,  Y. Goncharov,  S. Streltsov,  Fusion  of  Privileged  Features  for  Efficient  Classifier 
Training,  19th  International  Conference  on  Information  Fusion,  pp.1-8,  2016. 

(9)  V.Vapnik,  R.Izmailov,  “Knowledge  Transfer  in  SVM  and  Neural  Networks”,  Annals  of 
Mathematics  and  Artificial  Intelligence,  1-17,  2017. 


5  CONCLUSIONS 

We  have  successfully  developed  two  complementary  techniques  to  standard  machine  learning  approaches 
(model-driven  and  data-driven)  by  concentrating  on  their  drawbacks  and  addressing  them  with  advantages  of 
the  complementary  approach.  Specifically,  we  developed  novel  data-driven  techniques  for  improving  model- 
driven  approach,  and,  conversely,  novel  model-driven  techniques  for  improving  data-driven  approach.  Both 
developments  are  implemented  as  scalable  algorithms,  and  published  in  nine  papers  in  academic  journals  and 
conferences. 
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